Today, I would like to introduce “GRAPH ATTENTION NETWORKS” (GAT) (1). I like it very much as it has attention mechanisms. Let us see how it works.
- which nodes should we pay more attention to?
As I said before, GNN can be updated by taking information from neighbors. But you may be wondering which information is more important than others. In other words, which nodes should we pay more attention to? As the chart shows, the information from some nodes is more important than others. In the chart, rhe thicker red arrows from sender nodes to the receiver node is, the more attention GAT should pay to the node. But how can we know to which nodes should be paid more attention?
2. Attention mechanism
Some of you may not know about “attention mechanisms”, so I will explain it in detail. It is getting popular when the Natural Language processing (NLP) model called “transformer” introduces this mechanism in 2017. In this NLP model can understand what words are more important than others when the model considers one specific word in sentences. GAT introduces the same mechanism to understand “which nodes GAT should pay more attention to than other nodes when the information is gathered from neighbors?”. The chart below explains how the attention mechanism works. They are taken from the original research paper of GAT (1).
In order to understand “what nodes GAT should pay more attention to”, attention weights (red arrow) are needed. The bigger these weights are, the more attention GAT should pay. To calculate attention weights, firstly features of sender node(green arrow) and receiver node(blue arrow) are linearly transformed and concatenated. eij is calculated by a single layer neural network (formula 1). This is called “self-attention” and eij is called “attention coefficient”. Once attention coefficients of all sender nodes are obtained, we put them into the softmax function to normalize them (formular 2). Then “attention weights” of aij can be obtained (right illustration). When you want to know more, please check formula 3. Finally the receiver node can be updated base on “attention weights” aij (formula 4).
3. multi-head attention
GAT introduces multi-head attention. It means that GAT has several attention mechanisms (right illustration). K attention mechanisms execute the transformation of formula 4, and then their features are concatenated (formula 5). When we perform multi-head attention on the final layer of the network, instead of concatenation, GAT uses average of results from each attention head and delays applying a softmax for classification task (formula 6).
Hope you enjoy the article. I like GAT as it is easy to use and more accurate than other GNNs I explained before. I will update my article soon. Stay tuned!
(1) GRAPH ATTENTION NETWORKS, Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, Yoshua Bengio
Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.