Since 2018 is close to ending, I would like to talk to one of the biggest innovation of deep learning this year. This is called “BERT” presented by Google AI
I would like to focus on the structure of attention mechanism as I believe it is the most important in BERT. These are formulas of attention mechanism. But it is a little difficult to understand how it works.
1. Create Query, Key and Value vectors from tokens
So let me explain them step by step by using charts.
Fitst, we obtain the embedding vectors as input. They are considered as each word (this explanation is not accurate but I want to make it simple).
Second, we create wight matrix of query (Q), key(K) and value(V). These are key components of attention mechanism in BERT. With these matrices, we can create Q, K, and V from embedding vectors.
Third, Once we obtain Q, K, V, we would like to know how much each word should pay attention to other words. Dot product can be used to measure the relationship between one word and another.
Finally, output from BERT-base has 768 dimensions.
Because we are surrounding a massive amount of documentation such as contracts, financial reports, regulatory instructions, it is impossible to understand everything and extract information needed in a real-time manner. With BERT, we might much better applications to handle many texts and extract information needed. It is very exciting when we consider how many applications can be created by using BERT in near future.
uTo the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. In the following seeections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and .