CS224W Machine Learning with Graphs

Lecture 7

Title: Graph Representation Learning
Date: 2020. 04. 14 (TUE) ~ 2020. 04. 16 (THU)
Materials: Slides YouTube
Additional Materials:Negative Sampling(처음의 마음), Graph Embedding

Machine Learning in Network
- Node classification
- Link prediction
Machine Learning Lifecycle
- Raw Data -> Structured Data -> Learning Algorithm -> Model
- “Feature Engineering” is the process between Raw Data and Structured Data
- Task: How learn the features automatically
Feature Learning in Graphs
- Embedding: feature representation in the given nodes
- Map each node in a network into a “low-dimensional space”

Encode nodes into embedding space using similarity(\(ENC(u)\))
Goal: \(similarity(u, v) \approx {z_u}^\intercal{z_v}\)
Task
- Define an encoder
- Define a node similarity function
- Optimize the parameters of the encoder

\(ENC(v) = z_v\)(\(d\)-dimensional embedding)
Shallow Encoding
- \[ENC(v) = Zv\]
- \(Z \in \mathbb{R}^{d\times￨\mathcal{V}￨}\), \(v \in \mathbb{I}^{￨\mathcal{V}￨}\)
  - \(Z\) is embedding matrix(main goal)
  - Available methods: DeepWalk, node2vec, TransE

\(similarity(u, v) \approx {z_v}^\intercal{z_u}\)(dot product between node embeddings)

Random Walk on the graph is the random sequence of points(nodes)
- typically within 5 path’s long
\({z_u}^\intercal{z_v} \approx\) the probability of random walk path to start from \(u\) and end to \(v\)
- \(P_R(u￨v)\): the probability using random walk strategy \(R\)
- Similarity(\(cos(\theta)\)) \(\varpropto P_R(u￨v)\)
Reason why use random walk
- Expressivity
- Efficiency: Only use(consider) pairs that co-occur on random walks

\(N_r(u)\): Neighborhood nodes to obtain some strategy \(R\)
- The nodes are existed in the same path made by some strategy
Optimization
- Log-likelihood objective: \(max_z \sum_{ {u} \in {V}} logP(N_R(u)￨z_u)\)
Random Walk Optimization
- Run short fixed-length random walks starting from each node
- For each node \(u\) collect \(N_R(u)\)
- Optimize embeddings
  - Lost function(\(\mathcal{L}\)) \(= \sum_{ {u}\in{V}}\sum_{ {v}\in{N_R(u)}} - log(P(v￨z_u))\)
  - Parameterize \(P(v￨z_u)\) using softmax
    - \[P(v￨z_u) = { {exp(z_u^\intercal z_v)} \over {\sum_{ {n}\in{V}}exp(z_u^\intercal z_v)}}\]
  - Optimizing means finding embeddings \(z_u\) that minimize \(\mathcal{L}\) (Using Stochastic Gradient Descent)
Negative Sampling
- The concept is used in Word2Vec
- Instead of compute all nodes, compute only some random sample nodes(only related nodes and some of unrelated nodes in Word2Vec)

Capture local(BFS, microscopic view) and global(DFS, macroscopic view) views of the network
Interpolate BFS and DFS
- \(p\): the parameter of probability to return back to the previous node
- \(q\): the parameter of ratio of BFS(move outward) and DFS(move inward, move to other neighbors)
Node2Vec algorithm
1. Compute random walk probabilities
2. Simulate \(r\) random walks of length \(l\) starting from each node \(u\)
3. Optimize the node2vec objective using Stochastic Gradient Descent

TransE represented as triples, \((h, l, t)\) where \(h\) is head entity, \(l\) is relation, \(t\) is tail entity
If given fact is true, \(h + l \approx t\). However, it’s false, \(h + l \neq t\) in embedded space \(R^k\)

Run a standard graph embedding on the subgraphs and sum(or average) embedded vectors
- \[Z_G = \sum_{ {v} \in {G}}z_v\]
Introduce a “vritual node” to represent the subgraph and run a standard graph embedding technique
Using anonymous walk

anonymous walk