CS224W Machine Learning with Graphs
Lecture 7
Title: Graph Representation Learning
Date: 2020. 04. 14 (TUE) ~ 2020. 04. 16 (THU)
Materials: Slides YouTube
Additional Materials:Negative Sampling(처음의 마음), Graph Embedding
Introduction
-
Machine Learning in Network
- Node classification
- Link prediction
-
Machine Learning Lifecycle
- Raw Data -> Structured Data -> Learning Algorithm -> Model
- “Feature Engineering” is the process between Raw Data and Structured Data
- Task: How learn the features automatically
-
Feature Learning in Graphs
- Embedding: feature representation in the given nodes
- Map each node in a network into a “low-dimensional space”
Embedding Nodes
- Encode nodes into embedding space using similarity(\(ENC(u)\))
- Goal: \(similarity(u, v) \approx {z_u}^\intercal{z_v}\)
- Task
- Define an encoder
- Define a node similarity function
- Optimize the parameters of the encoder
Encoder
-
\(ENC(v) = z_v\)(\(d\)-dimensional embedding)
-
Shallow Encoding
- \[ENC(v) = Zv\]
- \(Z \in \mathbb{R}^{d\times│\mathcal{V}│}\), \(v \in \mathbb{I}^{│\mathcal{V}│}\)
- \(Z\) is embedding matrix(main goal)
- Available methods: DeepWalk, node2vec, TransE
Similarity function
- \(similarity(u, v) \approx {z_v}^\intercal{z_u}\)(dot product between node embeddings)
Random Walk Approaches to Node Embeddings
- Random Walk on the graph is the random sequence of points(nodes)
- typically within 5 path’s long
- \({z_u}^\intercal{z_v} \approx\) the probability of random walk path to start from \(u\) and end to \(v\)
- \(P_R(u│v)\): the probability using random walk strategy \(R\)
- Similarity(\(cos(\theta)\)) \(\varpropto P_R(u│v)\)
- Reason why use random walk
- Expressivity
- Efficiency: Only use(consider) pairs that co-occur on random walks
Unsupervised Feature Learning (DeepWalk)
- \(N_r(u)\): Neighborhood nodes to obtain some strategy \(R\)
- The nodes are existed in the same path made by some strategy
- Optimization
- Log-likelihood objective: \(max_z \sum_{ {u} \in {V}} logP(N_R(u)│z_u)\)
- Random Walk Optimization
- Run short fixed-length random walks starting from each node
- For each node \(u\) collect \(N_R(u)\)
- Optimize embeddings
- Lost function(\(\mathcal{L}\)) \(= \sum_{ {u}\in{V}}\sum_{ {v}\in{N_R(u)}} - log(P(v│z_u))\)
- Parameterize \(P(v│z_u)\) using softmax
- \[P(v│z_u) = { {exp(z_u^\intercal z_v)} \over {\sum_{ {n}\in{V}}exp(z_u^\intercal z_v)}}\]
- Optimizing means finding embeddings \(z_u\) that minimize \(\mathcal{L}\) (Using Stochastic Gradient Descent)
- Negative Sampling
- The concept is used in Word2Vec
- Instead of compute all nodes, compute only some random sample nodes(only related nodes and some of unrelated nodes in Word2Vec)
Node2Vec
-
Capture local(BFS, microscopic view) and global(DFS, macroscopic view) views of the network
-
Interpolate BFS and DFS
- \(p\): the parameter of probability to return back to the previous node
- \(q\): the parameter of ratio of BFS(move outward) and DFS(move inward, move to other neighbors)
-
Node2Vec algorithm
- Compute random walk probabilities
- Simulate \(r\) random walks of length \(l\) starting from each node \(u\)
- Optimize the node2vec objective using Stochastic Gradient Descent
Translating Embeddings for Modeling Multi-relational Data
- TransE represented as triples, \((h, l, t)\) where \(h\) is head entity, \(l\) is relation, \(t\) is tail entity
- If given fact is true, \(h + l \approx t\). However, it’s false, \(h + l \neq t\) in embedded space \(R^k\)
Embedding Entire Graphs
- Run a standard graph embedding on the subgraphs and sum(or average) embedded vectors
- \[Z_G = \sum_{ {v} \in {G}}z_v\]
- Introduce a “vritual node” to represent the subgraph and run a standard graph embedding technique
- Using anonymous walk