Lecture Slides

Lecture Notes

Decoder

image.png

Cross-attention is used in multi-modal tasks where the matrices come from different sequences:

This arrangement allows each position in the output sequence to query and extract relevant information from the entire input sequence. The attention weights calculated from Q×K^T determine how much each position in the output should focus on each position in the input.

It's exactly this separation of Q from K and V that defines cross-attention and distinguishes it from self-attention (where Q, K, and V all come from the same sequence).

Self-Supervised Pretraining Paradigm

image.png

BERT

BERT is fundamentally an encoder-only transformer architecture.

image.png

This is a lot like fill in the blank.

Special Tokens