1 Transformer

1.1 Representation Learning for NLP

NNs build representations of input data as vectors/embeddings, which encode useful statistical and semantic information about the data.
For NLP:
1. Recurrent Neural Networks (RNNs): sequential manner, i.e., one word at a time.
2. Transformers: attention mechanism to figure out how important all the other words in the sentence are w.r.t. to the aforementioned word.
  - The weighted sum of linear transformations of the features of all the words

The Transformer architecture. 1. 2.

The hidden feature of the -th word in a sentence from layer to layer is updated as:
- are learnable linear weights (denoting the Query, Key and Value for the attention computation, respectively).
- Attention mechanism pipeline:
Multi-head attention mechanism:
- is a down-projection to match the dimensions of and across layers.
- Motivation: bad random initializations for dot-product attention mechanism can de-stabilize the learning process.

Motivation: the features for words after the attention mechanism might be at different scales or magnitudes.
1. Issue 1: Some words having very sharp or very distributed attention weights.
  - Scaling the dot-product attention by the square-root of the feature dimension
2. Issue 2: Each of multiple attention head outputs values at different scales.
  - LayerNorm: normalizes and learns an abne transformation at the feature level.
Another 'trick' to control the scale issue: a position-wise 2-layer MLP:
Stacking layers: Residual connections between the inputs and outputs of each multi-head attention sub-layer and the feed-forward sub-layer.

Neighborhood aggregation (or [[message passing]])
- Each node gathers features from its neighbors to update its representation of the local graph structure around it.
  1. are learnable weight matrices of the GNN layer.
  2. is a non-linearity such as ReLU.
  3. The summation over the neighborhood nodes can be replaced by other input size-invariant aggregation functions.
    - Mean
    - Max
    - Weighted sum via an attention mechanism
Stacking several GNN layers enables the model to propagate each node's features over the entire graph.

Consider a sentence as a fully-connected graph, where each word is connected to every other word.
Transformers are GNNs with multi-head attention as the neighborhood aggregation function.
1. Transformers: entire fully-connected graph.
2. GNNs: local neighborhood.

Fully-connected graphs make learning very long-term dependencies between words difficult.
- In an word sentence, a Transformer/GNN would be doing computations over pairs of words.
Making the attention mechanism sparse or adaptive in terms of input size.
1. Adding recurrence or compression into each layer.
2. Locality Sensitive Hashing.

Attention for identifying which pairs are the most interesting enables Transformers to learn something like a task-specific syntax.
Different heads can be considered as different syntactic properties.

The optimization view: having multiple attention heads improves learning and overcomes bad random initializations.
GNNs with simpler aggregation functions such as sum or max do not require multiple aggregation heads for stable training.
ConvNet architecture

Hyper-parameter: Learning rate schedule, warmup strategy and decay settings.
The specific permutation of normalization and residual connections within the architecture.