Lecture Slides
Lecture Notes
Decoder

Cross-attention is used in multi-modal tasks where the matrices come from different sequences:
- Q (Query) matrix comes from the target/output sequence (e.g., the decoder's own representations)
- K (Key) and V (Value) matrices come from the source/input sequence (e.g., the encoder's output representations)
This arrangement allows each position in the output sequence to query and extract relevant information from the entire input sequence. The attention weights calculated from Q×K^T determine how much each position in the output should focus on each position in the input.
It's exactly this separation of Q from K and V that defines cross-attention and distinguishes it from self-attention (where Q, K, and V all come from the same sequence).
Self-Supervised Pretraining Paradigm

BERT
BERT is fundamentally an encoder-only transformer architecture.
- Masked Language Modeling (MLM):
- Randomly mask 15% of tokens in each sequence
- The model must predict these masked tokens based on surrounding context
- This forces bidirectional understanding of context
- Next Sentence Prediction (NSP):
- Given two sentences, predict whether the second follows the first in the original text
- This helps the model understand relationships between sentences
- Input format: [CLS] Sentence A [SEP] Sentence B [SEP]

This is a lot like fill in the blank.
Special Tokens
- [CLS] (Classification): First token of every sequence, its final representation is used for classification tasks