GNN introduction GNN简介

写得好好！循序渐进！图例简直了超级形象呀！

向全世界安利这个博客作为GNN的入门博客！

不过这是2021年的了，一些比较新的技术和方向没有包含

A Gentle Introduction to Graph Neural Networks

Graphs have up to four types of information that we will potentially want to use to make predictions: nodes, edges, global-context and connectivity

Graph represent

we need to represent: nodes, edges, global-context and connectivity

representing a graph’s connectivity is more complicated.
Perhaps the most obvious choice would be to use an adjacency matrix, since this is easily tensorisable.
However, this representation has a few drawbacks.
From the example dataset table, we see the number of nodes in a graph can be on the order of millions, and the number of edges per node can be highly variable. Often, this leads to very sparse adjacency matrices, which are space-inefficient.
Another problem is that there are many adjacency matrices that can encode the same connectivity, and there is no guarantee that these different matrices would produce the same result in a deep neural network (that is to say, they are not permutation invariant).

permutation invariant: If we permute the nodes in some way, the resulting representations of the nodes as computed by our algorithms should also be permuted in the same way.

nodes: node embeddings (information/attributes/representation of nodes, node vectors)

edges: edge embeddings

adjacency lists: These describe the connectivity of edge ek between nodes ni and nj as a tuple (i,j) in the k-th entry of an adjacency list.

global: global embeddings

On one side we have a small graph and on the other the information of the graph in a tensor representation.

It should be noted that the figure uses scalar values per node/edge/global, but most practical tensor representations have vectors per graph attribute. Instead of a node tensor of size [nnodes] we will be dealing with node tensors of size [nnodes,nodedim]. Same for the other graph attributes.

Now that the graph’s description is in a matrix format that is permutation invariant,

What’s a GNN

A GNN is an optimizable transformation on all attributes of the graph (nodes, edges, global-context) that preserves graph symmetries (permutation invariances).

The simplest GNN

With the numerical representation of graphs that we’ve constructed above (with vectors instead of scalars), we are now ready to build a GNN.

We will start with the simplest GNN architecture, one where we learn new embeddings for all graph attributes (nodes, edges, global), but where we do NOT yet use the connectivity of the graph.

This GNN uses a separate multilayer perceptron (MLP) (or your favorite differentiable model) on **each component **(V,E,U，共三个component) of a graph; we call this a GNN layer. For each node vector, we apply the MLP and get back a learned node-vector. We do the same for each edge, learning a per-edge embedding, and also for the global-context vector, learning a single embedding for the entire graph.

每个node vector、每个edge vector以及global vector都要过一个MLP是合理的：回忆普通神经网络，对于每个输入的vector，都是这个vector要过一个MLP

这个最简单的GNN，其实即非常直接地对图数据应用普通神经网络，比如想对node vector做预测，于是对node vector训练数据过神经网络做训练；这样存在的问题是，（以node vector为例）node vector并不是相互独立的，而是有边相连（这和普通神经网络假设输入样本彼此独立不一样），后面的message passing GNN layer就是考虑了**拓扑上的相关**，在训练的时候不是每个node vector独立过神经网络，而是在过网络之前先和其拓扑邻居的vector聚合一下，由此来利用拓扑关联信息

a GNN layer: You could also call it a GNN block. Because it contains multiple operations/layers (like a ResNet block).

A single layer of a simple GNN. A graph is the input, and each component (V,E,U) gets updated by a MLP (f in this figure) to produce a new graph. Each function subscript indicates a separate function for a different graph attribute at the n-th layer of a GNN model.

As is common with neural networks modules or layers, we can stack these GNN layers together.

Because a GNN does not update the connectivity of the input graph, we can describe the output graph of a GNN with the same adjacency list and the same number of feature vectors as the input graph. But, the output graph has updated embeddings, since the GNN has updated each of the node, edge and global-context representations.

GNN Predictions by Pooling Information

We have built a simple GNN, but how do we make predictions in any of the tasks we described above?

We will consider the case of binary classification.

If the task is to make binary predictions on nodes, and the graph already contains node information, the approach is straightforward — for each node embedding, apply a linear classifier.

However, it is not always so simple. For instance, you might have information in the graph stored in edges, but no information in nodes, but still need to make predictions on nodes.

e.g., We could imagine a social network, where we wish to anonymize user data (nodes) by not using them, and only using relational data (edges).

We need a way to collect information from edges and give them to nodes (将edge embeddings转化为node embeddings) for prediction. We can do this by pooling. Pooling proceeds in two steps:

For each item to be pooled (比如顶点的所有邻边), gather each of their embeddings and concatenate them into a matrix.
The gathered embeddings are then aggregated, usually via a sum operation.

We represent the pooling operation by the letter ρ, and denote that we are gathering information from edges to nodes as ρEn→Vn.

Hover over a node to visualize which edges are gathered and aggregated to produce an embedding for that target node.

So If we only have edge-level features, and are trying to predict binary node information, we can use pooling to route (or pass) information to where it needs to go. The model looks like this.

If we only have node-level features, and are trying to predict binary edge-level information, the model looks like this.

e.g., Nodes can be recognized as image entities, and we are trying to predict if the entities share a relationship (binary edges).

If we only have node-level features, and need to predict a binary global property, we need to gather all available node information together and aggregate them. This is similar to Global Average Pooling layers in CNNs. The same can be done for edges.

e.g., This is a common scenario for predicting molecular properties. For example, we have atomic information, connectivity and we would like to know the toxicity of a molecule (toxic/not toxic), or if it has a particular odor (rose/not rose).

In our examples, the classification model c can easily be replaced with any differentiable model, or adapted to multi-class classification using a generalized linear model.

An end-to-end prediction task with a GNN model.

Now we’ve demonstrated that we can build a simple GNN model, and make binary predictions by routing information between different parts of the graph. This pooling technique will serve as a building block for constructing more sophisticated GNN models. If we have new graph attributes, we just have to define how to pass information from one attribute to another.

Note that in this simplest GNN formulation, we’re not using the connectivity of the graph at all inside the GNN layer. Each node is processed independently, as is each edge, as well as the global context (都是每个vector单独过一个MLP). We only use connectivity when pooling information for prediction.

Passing messages between parts of the graph

We could make more sophisticated predictions by using pooling within the GNN layer, in order to make our learned embeddings aware of graph connectivity. We can do this using message passing[18], where neighboring nodes or edges exchange information and influence each other’s updated embeddings.

Message passing works in three steps:

For each node in the graph, gather all the neighboring node embeddings (or messages), which is the g function described above.
Aggregate all messages via an aggregate function (like sum).
All pooled messages are passed through an update function, usually a learned neural network.

You could also 1) gather messages, 3) update them and 2) aggregate them and still have a permutation invariant operation.[20]

Just as pooling can be applied to either nodes or edges, message passing can occur between either nodes or edges.

These steps are key for leveraging the connectivity of graphs. We will build more elaborate variants of message passing in GNN layers that yield GNN models of increasing expressiveness and power.

This sequence of operations, when applied once, is the simplest type of message-passing GNN layer. (相较于simpliest GNN layer，在对vector应用function f之前添加了message-passing/“pooling”（gather-aggregate）的操作)

This is reminiscent of standard convolution: in essence, message passing and convolution are operations to aggregate and process the information of an element’s neighbors in order to update the element’s value. In graphs, the element is a node, and in images, the element is a pixel. However, the number of neighboring nodes in a graph can be variable, unlike in an image where each pixel has a set number of neighboring elements.

By stacking message passing GNN layers together, a node can eventually incorporate information from across the entire graph: after three layers, a node has information about the nodes three steps away from it.

We can update our architecture diagram to include this new source of information for nodes:

Schematic for a GCN architecture, which updates node representations of a graph by pooling neighboring nodes at a distance of one degree.

Learning edge representations

Our dataset does not always contain all types of information (node, edge, and global context). When we want to make a prediction on nodes, but our dataset only has edge information, we showed above how to use pooling to route information from edges to nodes, but only at the final prediction step of the model. We can share information between nodes and edges within the GNN layer using message passing.

However, the node and edge information stored in a graph are not necessarily the same size or shape, so it is not immediately clear how to combine them. One way is to learn a linear mapping from the space of edges to the space of nodes, and vice versa. Alternatively, one may concatenate them together before the update function.

Architecture schematic for Message Passing layer. 在把embedding送到普通神经网络之前，增加下面的这个步骤，使得node/edge vector可以用上图的拓扑信息（而不是仅用node/edge vector这个vector本身的信息）：The first step (即ρVn→En) “prepares” a message composed of information from an edge and it’s connected nodes and then (即ρEn→Vn，对node vector跟新时，用的邻边的vector是来自上一步(ρVn→En)聚合得到的message) “passes” the message to the node.

Which graph attributes we update and in which order we update them is one design decision when constructing GNNs. We could choose whether to update node embeddings before edge embeddings, or the other way around. This is an open area of research with a variety of solutions– for example we could update in a ‘weave’ fashion[21] where we have four updated representations that get combined into new node and edge representations: node to node (linear), edge to edge (linear), node to edge (edge layer), edge to node (node layer).

Some of the different ways we might combine edge and node representation in a GNN layer.

Adding global representations

There is one flaw with the networks we have described so far: nodes that are far away from each other in the graph may never be able to efficiently transfer information to one another, even if we apply message passing several times. For one node, If we have k-layers, information will propagate at most k-steps away. This can be a problem for situations where the prediction task depends on nodes, or groups of nodes, that are far apart.

One solution would be to have all nodes be able to pass information to each other (即任意两点之间有“边”相连，比如Node to node的fashion，则在一个GNN层中求一个node vector时，是先聚合当前node的vector和全图所有其他顶点的vector、然后送到神经网络中). Unfortunately for large graphs, this quickly becomes computationally expensive (although this approach, called ‘virtual edges’, has been used for small graphs such as molecules).[18]

One solution to this problem is by using the global representation of a graph (U) which is sometimes called a master node [19][18] or context vector. This global context vector is connected to all other nodes and edges in the network, and can act as a bridge between them to pass information, building up a representation for the graph as a whole. This creates a richer and more complex representation of the graph than could have otherwise been learned.

Schematic of a Graph Nets architecture leveraging global representations. U是global representation，它和所有点边都相连，所以既是点又是边：上图中，第一步Vn→En和Un→En，其实Un也是Vn的角色；第二步；第三步

In this view all graph attributes have learned representations, so we can leverage them during pooling by conditioning the information of our attribute of interest with respect to the rest. For example, for one node we can consider information from neighboring nodes, connected edges and the global information. To condition the new node embedding on all these possible sources of information, we can simply concatenate them. Additionally we may also map them to the same space via a linear map and add them or apply a feature-wise modulation layer[22] (feature-wise的意思是，apply的这个layer，对node info、edge info和global info apply的layer不一样，对于node information有一个layer，对于edge information有另一个layer，对于global infor还有一个) , which can be considered a type of featurize-wise attention mechanism.

Schematic for conditioning the information of one node based on three other embeddings (adjacent nodes, adjacent edges, global). This step corresponds to the node operations in the Graph Nets Layer.

GNN playground

We’ve described a wide range of GNN components here, but how do they actually differ in practice? This GNN playground allows you to see how these different components and architectures contribute to a GNN’s ability to learn a real task.

Our playground shows a graph-level prediction task with small molecular graphs. We use the the Leffingwell Odor Dataset[23][24], which is composed of molecules with associated odor percepts (labels). Predicting the relation of a molecular structure (graph) to its smell is a 100 year-old problem straddling chemistry, physics, neuroscience, and machine learning.

To simplify the problem, we consider only a single binary label per molecule, classifying if a molecular graph smells “pungent” or not, as labeled by a professional perfumer.

We say a molecule has a “pungent” scent if it has a strong, striking smell. For example, garlic and mustard, which might contain the molecule allyl alcohol have this quality. The molecule piperitone, often used for peppermint-flavored candy, is also described as having a pungent smell.

We represent each molecule as a graph, where atoms are nodes containing a one-hot encoding for its atomic identity (哪种原子) (Carbon, Nitrogen, Oxygen, Fluorine) and bonds are edges containing a one-hot encoding its bond type (single, double, triple or aromatic).

Our general modeling template for this problem will be built up using sequential GNN layers, followed by a linear model with a sigmoid activation for classification. The design space for our GNN has many levers that can customize the model:

The number of GNN layers, also called the depth.
The dimensionality of each attribute when updated. The update function is a 1-layer MLP with a relu activation function and a layer norm for normalization of activations.
The aggregation function used in pooling: max, mean or sum.
The graph attributes that get updated, or styles of message passing: nodes, edges and global representation. We control these via boolean toggles (on or off). A baseline model would be a graph-independent GNN (all message-passing off) which aggregates all data at the end into a single global attribute. Toggling on all message-passing functions yields a GraphNets architecture.

To better understand how a GNN is learning a task-optimized representation of a graph, we also look at the penultimate layer activations of the GNN. These ‘graph embeddings’ are the outputs of the GNN model right before prediction. Since we are using a generalized linear model for prediction, a linear mapping is enough to allow us to see how we are learning representations around the decision boundary.

Since these are high dimensional vectors, we reduce them to 2D via principal component analysis (PCA). A perfect model would visibility separate labeled data, but since we are reducing dimensionality and also have imperfect models, this boundary might be harder to see.

https://distill.pub/2021/gnn-intro/#gnn-playground

Play around with different model architectures to build your intuition. For example, see if you can edit the molecule on the left to make the model prediction increase. Do the same edits have the same effects for different model architectures?

Some empirical GNN design lessons

When exploring the architecture choices above, you might have found some models have better performance than others. Are there some clear GNN design choices that will give us better performance? For example, do deeper GNN models perform better than shallower ones? or is there a clear choice between aggregation functions? The answers are going to depend on the data, [25][26], and even different ways of featurizing and constructing graphs can give different answers.

With the following interactive figure, we explore the space of GNN architectures and the performance of this task across a few major design choices: Style of message passing, the dimensionality of embeddings, number of layers, and aggregation operation type.

the number of trainable variables

Each point in the scatter plot represents a model: the x axis is the number of trainable variables, and the y axis is the performance. Hover over a point to see the GNN architecture parameters.

The first thing to notice is that, surprisingly, a higher number of parameters does correlate with higher performance. GNNs are a very parameter-efficient model type: for even a small number of parameters (3k) we can already find models with high performance.

the dimensionality of embeddings

Next, we can look at the distributions of performance aggregated based on the dimensionality of the learned representations for different graph attributes.

Aggregate performance of models across varying node, edge, and global dimensions.

We can notice that models with higher dimensionality tend to have better mean and lower bound performance but the same trend is not found for the maximum. Some of the top-performing models can be found for smaller dimensions. Since higher dimensionality is going to also involve a higher number of parameters, these observations go in hand with the previous figure.

the number of GNN layers

Next we can see the breakdown of performance based on the number of GNN layers.

Chart of number of layers vs model performance, and scatterplot of model performance vs number of parameters. Each point is colored by the number of layers. Hover over a point to see the GNN architecture parameters.

The box plot shows a similar trend, while the mean performance tends to increase with the number of layers, the best performing models do not have three or four layers, but two. Furthermore, the lower bound for performance decreases with four layers. This effect has been observed before, GNN with a higher number of layers will broadcast information at a higher distance and can risk having their node representations ‘diluted’ from many successive iterations [27].

aggregation operation type

Does our dataset have a preferred aggregation operation? Our following figure breaks down performance in terms of aggregation type.

Chart of aggregation type vs model performance, and scatterplot of model performance vs number of parameters. Each point is colored by aggregation type. Hover over a point to see the GNN architecture parameters.

Overall it appears that sum has a very slight improvement on the mean performance, but max or mean can give equally good models. This is useful to contextualize when looking at the discriminatory/expressive capabilities of aggregation operations .

The previous explorations have given mixed messages. We can find mean trends where more complexity gives better performance but we can find clear counterexamples where models with fewer parameters, number of layers, or dimensionality perform better. One trend that is much clearer is about the number of attributes that are passing information to each other (下面展示的是这个).

the style of message passing

Here we break down performance based on the style of message passing. On both extremes, we consider models that do not communicate between graph entities (“none”) and models that have messaging passed between nodes, edges, and globals.

Chart of message passing vs model performance, and scatterplot of model performance vs number of parameters. Each point is colored by message passing. Hover over a point to see the GNN architecture parameters

Overall we see that the more graph attributes are communicating, the better the performance of the average model. Our task is centered on global representations, so explicitly learning this attribute also tends to improve performance. Our node representations also seem to be more useful than edge representations, which makes sense since more information is loaded in these attributes.

There are many directions you could go from here to get better performance. We wish two highlight general directions, one related to more sophisticated graph algorithms and another towards the graph itself.

Up until now, our GNN is based on a neighborhood-based pooling operation. There are some graph concepts that are harder to express in this way, for example a linear graph path (a connected chain of nodes). Designing new mechanisms in which graph information (全图拓扑信息) can be extracted, executed and propagated in a GNN is one current research area [28], [29], [30], [31].

One of the frontiers of GNN research is not making new models and architectures, but “how to construct graphs”, to be more precise, imbuing graphs with additional structure or relations that can be leveraged. As we loosely saw, the more graph attributes are communicating the more we tend to have better models. In this particular case, we could consider making molecular graphs more feature rich, by adding additional spatial relationships between nodes, adding edges that are not bonds, or explicit learnable relationships between subgraphs.

Into the Weeds

Next, we have a few sections on a myriad of graph-related topics that are relevant for GNNs.

Other types of graphs (multigraphs, hypergraphs, hypernodes, hierarchical graphs)

While we only described graphs with vectorized information for each attribute, graph structures are more flexible and can accommodate other types of information. Fortunately, the message passing framework is flexible enough that often adapting GNNs to more complex graph structures is about defining how information is passed and updated by new graph attributes.

For example, we can consider multi-edge graphs or multigraphs[32], where a pair of nodes can share multiple types of edges,

this happens when we want to model the interactions between nodes differently based on their type. For example with a social network, we can specify edge types based on the type of relationships (acquaintance, friend, family).

A GNN can be adapted by having different types of message passing steps for each edge type.

We can also consider nested graphs, where for example a node represents a graph, also called a hypernode graph.[33]

Nested graphs are useful for representing hierarchical information. For example, we can consider a network of molecules, where a node represents a molecule and an edge is shared between two molecules if we have a way (reaction) of transforming one to the other [34][35].

In this case, we can learn on a nested graph by having a GNN that learns representations at the molecule level and another at the reaction network level, and alternate between them during training (这个的意思是，比如在reaction network level的GNN中学到的node embedding，是molecule level的GNN中的global embedding).

Sampling Graphs and Batching in GNNs

A common practice for training neural networks is to update network parameters with gradients calculated on randomized constant size (batch size) subsets of the training data (mini-batches).

This practice presents a challenge for graphs due to the variability in the number of nodes and edges adjacent to each other, meaning that we cannot have a constant batch size. The main idea for batching with graphs is to create subgraphs that preserve essential properties of the larger graph. This graph sampling operation is highly dependent on context and involves sub-selecting nodes and edges from a graph (sub-selecting的意思就是从全图的顶点集合和边集合中取子集). These operations might make sense in some contexts (citation networks) and in others, these might be too strong of an operation (molecules, where a subgraph simply represents a new, smaller molecule).

How to sample a graph is an open research question.[39] If we care about preserving structure at a neighborhood level, one way would be to randomly sample a uniform number of nodes, our node-set. Then add neighboring nodes of distance k adjacent to the node-set, including their edges.[40] Each neighborhood can be considered an individual graph and a GNN can be trained on batches of these subgraphs. The loss can be masked to only consider the node-set since all neighboring nodes would have incomplete neighborhoods. A more efficient strategy might be to first randomly sample a single node, expand its neighborhood to distance k, and then pick the other node within the expanded set. These operations can be terminated once a certain amount of nodes, edges, or subgraphs are constructed. If the context allows, we can build constant size neighborhoods by picking an initial node-set and then sub-sampling a constant number of nodes (e.g randomly, or via a random walk or Metropolis algorithm[41]).

Four different ways of sampling the same graph. Choice of sampling strategy depends highly on context since they will generate different distributions of graph statistics (# nodes, #edges, etc.). For highly connected graphs, edges can be also subsampled.

Sampling a graph is particularly relevant when a graph is large enough that it cannot be fit in memory. Inspiring new architectures and training strategies such as Cluster-GCN [42] and GraphSaint [43]. We expect graph datasets to continue growing in size in the future.

Inductive biases

When building a model to solve a problem on a specific kind of data, we want to specialize our models to leverage the characteristics of that data. When this is done successfully, we often see better predictive performance, lower training time, fewer parameters and better generalization.

When labeling on images, for example, we want to take advantage of the fact that a dog is still a dog whether it is in the top-left or bottom-right corner of an image. Thus, most image models use convolutions, which are translation invariant. For text, the order of the tokens is highly important, so recurrent neural networks process data sequentially. Further, the presence of one token (e.g. the word ‘not’) can affect the meaning of the rest of a sentence, and so we need components that can ‘attend’ to other parts of the text, which transformer models like BERT and GPT-3 can do. These are some examples of inductive biases, where we are identifying symmetries or regularities in the data and adding modelling components that take advantage of these properties.

In the case of graphs, we care about how each graph component (edge, node, global) is related to each other so we seek models that have a relational inductive bias.[19] A model should preserve explicit relationships between entities (adjacency matrix) and preserve graph symmetries (permutation invariance). We expect problems where the interaction between entities is important will benefit from a graph structure. Concretely, this means designing transformation on sets: the order of operation on nodes or edges should not matter and the operation should work on a variable number of inputs (这句话没有看懂).

Comparing aggregation operations

Pooling information from neighboring nodes and edges is a critical step in any reasonably powerful GNN architecture. Because each node has a variable number of neighbors, and because we want a differentiable method of aggregating this information, we want to use a smooth aggregation operation that is invariant to node ordering and the number of nodes provided.

Selecting and designing optimal aggregation operations is an open research topic.[44] A desirable property of an aggregation operation is that similar inputs provide similar aggregated outputs (这个的意思是，对一个这样的函数，输入1和输入2类似的话，输出1和输出2也要类似), and vice-versa. Some very simple candidate permutation-invariant operations are sum, mean, and max. Summary statistics like variance also work. All of these take a variable number of inputs, and provide an output that is the same, no matter the input ordering.

There is no operation that is uniformly the best choice. The mean operation can be useful when nodes have a highly-variable number of neighbors or you need a normalized view of the features of a local neighborhood. The max operation can be useful when you want to highlight single salient features in local neighborhoods. Sum provides a balance between these two, by providing a snapshot of the local distribution of features, but because it is not normalized, can also highlight outliers. In practice, sum is commonly used.

Designing aggregation operations is an open research problem that intersects with machine learning on sets.[45] New approaches such as Principal Neighborhood aggregation[27] take into account several aggregation operations by concatenating them and adding a scaling function that depends on the degree of connectivity of the entity to aggregate. Meanwhile, domain specific aggregation operations can also be designed. One example lies with the “Tetrahedral Chirality” aggregation operators [46].

GCN as subgraph function approximators

Another way to see GCN (and MPNN) of k-layers with a 1-degree neighbor (一阶邻居) lookup (意思是，message passing时pooling的范围，对顶点来说是其一阶邻居) is as a neural network (下图中的gsubgraphs) that operates on learned embeddings of subgraphs of size k[47][44]: (放到前面讲解的GNN框架中，下图中的gsubgraphs对应着下游的graph-level task的预测阶段(这个阶段拿已经学习到的embeddings去做预测))

When focusing on one node, after k-layers, the updated node representation has a limited viewpoint of all neighbors up to k-distance, essentially a subgraph representation. Same is true for edge representations.

So a GCN is collecting all possible subgraphs of size k and learning vector representations from the vantage point of one node or edge. The number of possible subgraphs can grow combinatorially, so enumerating these subgraphs from the beginning vs building them dynamically as in a GCN, might be prohibitive.

Edges and the Graph Dual

One thing to note is that edge predictions and node predictions, while seemingly different, often reduce to the same problem: an edge prediction task on a graph G can be phrased as a node-level prediction on G’s dual.

To obtain G’s dual, we can convert nodes to edges (and edges to nodes). A graph and its dual contain the same information, just expressed in a different way. Sometimes this property makes solving problems easier in one representation than another, like frequencies in Fourier space.

In short, to solve an edge classification problem on G, we can think about doing graph convolutions on G’s dual (which is the same as learning edge representations on G) (这句没看懂), this idea was developed with Dual-Primal Graph Convolutional Networks.[48]

Graph convolutions as matrix multiplications, and matrix multiplications as walks on a graph

We’ve talked a lot about graph convolutions and message passing, and of course, this raises the question of how do we implement these operations in practice? For this section, we explore some of the properties of matrix multiplication, message passing, and its connection to traversing a graph.

个人感觉graph convolutions可以理解成是message passing，

回忆图像领域的卷积神经网络，是先对图像进行卷积计算，为每个pixel做卷积（即聚合它周围的pixel）得到每个pixel的向量表示，然后把这个送到神经网络中去，

graph convolution也是类似的，在送到神经网络中之前，增加一步，是利用图的结构信息，通过message passing来“扩充”当前vector（比如node vector或edge vector或global vector等）

The first point we want to illustrate is that the matrix multiplication of an adjacent matrix A of size nnodes×nnodes with a node feature matrix X of size nnodes×nodedim implements an simple message passing with a summation aggregation (对于顶点特征的每一维，累加顶点的一阶邻居的vector的这一维).

Let the matrix be B=AX, we can observe that any entry Bij can be expressed as <Arowi, Xcolumnj>=Ai,1 X1,j+Ai,2 X2,j+…+Ai,n Xn,j=∑Ai,k Xk,j. Because Ai,k are binary entries only when a edge exists between nodei and nodek, the inner product is essentially “gathering” all node features values of dimension j that share an edge with nodei. It should be noted that this message passing is not updating the representation of the node features, just pooling neighboring node features. But this can be easily adapted by passing X through your favorite differentiable transformation (e.g. MLP) before or after the matrix multiply.

From this view, we can appreciate the benefit of using adjacency lists. Due to the expected sparsity of A we don’t have to sum all values where Ai,j is zero. As long as we have an operation to gather values based on an index, we should be able to just retrieve positive entries. Additionally, this matrix multiply-free approach frees us from using summation as an aggregation operation.

We can imagine that applying this operation multiple times allows us to propagate information at greater distances. In this sense, matrix multiplication is a form of traversing over a graph. This relationship is also apparent when we look at powers A^k of the adjacency matrix. If we consider the matrix A^2, the term A^2ij counts all walks of length 2 from nodei to nodej and can be expressed as the inner product <Arowi,Acolumnj>=Ai,1 A1,j+Ai,2 A2,j+…+Ai,n An,j . The intuition is that the first term Ai,1 A1,j
is only positive under two conditions, there is edge that connects nodei to node1 and another edge that connects node1 to nodej. In other words, both edges form a path of length 2 that goes from nodei to nodej passing by node1. Due to the summation, we are counting over all possible intermediate nodes. This intuition carries over when we consider A^3=A*A^2.. and so on to A^k.

There are deeper connections on how we can view matrices as graphs to explore [49][50][51].

Graph Attention Networks

Another way of communicating information between graph attributes is via attention.[52] For example, when we consider the sum-aggregation of a node and its 1-degree neighboring nodes (一阶邻居) we could also consider using a weighted sum.

这里我感觉，像是aggregation function设计领域的

The challenge then is to associate weights in a permutation invariant fashion. One approach is to consider a scalar scoring function that assigns weights based on pairs of nodes (f(nodei,nodej)). In this case, the scoring function can be interpreted as a function that measures how relevant a neighboring node is in relation to the center node. Weights can be normalized, for example with a softmax function to focus most of the weight on a neighbor most relevant for a node in relation to a task. This concept is the basis of Graph Attention Networks (GAT) [53] and Set Transformers[54]. Permutation invariance is preserved, because scoring works on pairs of nodes. A common scoring function is the inner product and nodes are often transformed before scoring into query and key vectors via a linear map to increase the expressivity of the scoring mechanism. Additionally for interpretability, the scoring weights can be used as a measure of the importance of an edge in relation to a task.

Schematic of attention over one node with respect to it’s adjacent nodes. For each edge an interaction score is computed, normalized and used to weight node embeddings.

这里没有看很懂：

Additionally, transformers can be viewed as GNNs with an attention mechanism [55]. Under this view, the transformer models several elements (i.g. character tokens) as nodes in a fully connected graph and the attention mechanism is assigning edge embeddings to each node-pair which are used to compute attention weights. The difference lies in the assumed pattern of connectivity between entities, a GNN is assuming a sparse pattern and the Transformer is modelling all connections.

我理解的GNN过程

k层的GNN：
对于一个顶点v，v以及其k阶邻居（及以内的邻居），初始化每人一个embedding，
然后
第一层：这些顶点进行消息传递（message-aggregate），然后每个顶点都update得到新的embedding
第二层：消息传递，update
…