CS224W Machine Learning with Graphs
Lecture 8
Title: Graph Neural Networks
Date: 2020. 04. 17 (FRI) ~ 2020. 04. 20 (MON)
Materials: Slides YouTube
Additional Materials: Pooling, GNN
Shallow Encoders(Node Embedding)
- Map nodes in original graph to \(d\)-dimensional embedding space
- \[ENC(v) = z_v\]
- Goal is that the the doc product of two node points in the embedding vector close to similarity of two nodes
- \[similarity(u, v) \approx z_v^\intercal z_u\]
- Limitation
- Every node has its own uniques embedding
- Cannot generate embedding for nodes that are not seen during training
- Do not incorporate node features
Deep Graph Encoders
- Using neural networks to encode and combining with node similarity functions
- The challenge is the Modern ML/DL Toolboxes specialized the simple data type(sequences or grids)
- Graph doesn’t have reference point(where to move)
- Naive Approach
- Using Adjacency matrix and feature to apply neural net
- Issues
- too many parameters(# of the nodes and # of features)
- It’s not applicable to graphs of different sizes
- It’s not invariant to node ordering
Basics of Deep Learning for Graphs
-
GCN (Graph Convolution Networks)
- Key concepts
- Understanding local network neighborhoods
- Stacking multiple layers
- Graph \(G\)
- \(V\): vertex set
- \(A\): adjacency matrix
- \(X \in \mathbb{R}^{m \times │V│}\): matrix of node features
- Aggregate Neighbors
- Generate embeddings based on local network neighborhoods
- Using neural networks, nodes aggregate information from their neighbors(within certain steps)
- Compute every nodes, Every nodes have the different neural networks
- Aggregation
- Basically, the average information from neighbors to apply a neural network
- The Math: Deep Encoder
- \(h_v^0 = x_v\): 0-th layer of node \(v\) equals to the features of node \(v\)
- \(z_v = h_v^K\): The final result(embedding) is the result of last layer(\(K\))
- \(h_v^k\): The result of every layer(except last layer)
\(h_v^k = \sigma (W_k \sum_{u \in N(v)} { {h_u^{k-1}} \over {│N(v)}} + B_kh_v^{k-1}), \forall k \in \{1,...,K\}\)
- \(\sigma\) is the Non-linearity like ReLU
- \(W_k, B_k\) is the parameter matrix
- These two parameter is trade-off relationship
- If \(W_k\) is high, it means the node \(u\) determine from more its neighbors
- If \(B_k\) is high, it means the node \(u\) determine from itself features
- \(N(v)\) is the set of neighbors of node \(u\)
- \(\sum_{u \in N(v)} { {h_u^{k-1}} \over {│N(v)}}\) is the average of the results in previous layer(\(k-1\))
- \(h_v^{k-1}\) is the message of node \(v\) self
-
Unsupervised Training
- Using Random walks, Graph factorization, Node proximity in the graph
-
Supervised Training
- Using loss function(\(\mathcal{L}\)), train the way toward minimization of loss
-
Inductive Capability
- The parameters \(W_k, B_k\) can be used in other graphs(unseen nodes)
- In the huge graph, we can train with snapshot and apply same parameters to other part of graph
Graph Convolutional Networks and GraphSAGE
- Apply a different aggregation function instead of average
- \[h_v^k = \sigma ([W_k \cdot AGG(\{h_u^{k-1}, \forall_u\in{N(v)}\}), B_kh_v^{k-1}])\]
-
\[AGG(\{h_u^{k-1}, \forall_u\in{N(v)}\})\]
- Mean
\(AGG = \sum_{u\in{N(v)}}{ {h_u^{k-1}}\over{│N(v)│}}\) - Pool: Pooling strategy like mean pooling or max pooling
\(AGG = \gamma(\{ Qh_u^{k-1}, \forall_u \in {N(v)} \})\) - LSTM
\(AGG = LSTM([h_u^{k-1}, \forall_u \in {\pi(N(v))}])\)
- Mean
- Efficient Implementation
- Using sparse matrix operations
- \(\sum_{u \in N(v)} { {h_u^{k-1}} \over {│N(v)}}\) can be calculated \(H^k = D^{-1}AH^{k-1}\) where, \(H^k = [H_1^k, ..., H_n^k]\), \(D\) is degree matrix, \(A\) is adjacency matrix
Graph Attention Networks (GAT)
- Simple neighborhood aggregation has every neighbors have same weight(importance)
- \(\alpha_{vu} = 1 / │N(v)│\) is the weighting factor of node \(u\)’s message to node \(v\) in the average aggregation
- Attention machanism
- \[e_{vu} = a(W_kh_u^{k-1}, W_kh_v^{k-1})\]
- Normalize coefficients using the softmax function
- \(\alpha_{vu} = { {exp(e_{vu})}\over{\sum_{k \in {N(v)} exp(e_{vu})}} }\) \(h_v^k = \sigma(\sum_{u \in {N(v)}} \alpha_{vu} W_kh_u^{k-1})\)