learn LLMs from scratch speedrun

17 Mar, 2025

10:19 PM 17-03-2025

task :

be able to explain how LLMs work at the core level

i'll start from where it all began. dall e 2. found dall-e-2 on twitter. amazing. that's how i found openai. in fact, my profile picture till this day is the image i generated in dall e 2. output wasn't as good as their examples but it was still something that i've never seen in my life. wait. forget all this.

let's start with the transformers paper. i've spent some time in the past but it wasn't fruitful. i abandoned it midway. i wanna do it again. i guess i'm delusionally stuck in my ways. i'm still hopeful.

THAT transformers paper

10:48 PM 17-03-2025

basically, they had RNNs but that was too sequential => can't parallelize much; so they came up with transformer architecture based on something called attention mechanism. in this mechanism, more parallelization is allowed => faster everything

11:14 PM 17-03-2025

self attention and multi-headed attention ?

in this 'encoder-decoder' system, a sequence of input tokens is given which should result in a sequence of output tokens, one token at once, continuously. even the tokens that are already generated, even these are also considered while generating new output tokens. maybe this is why, chatgpt/deepseek starts answering the question and then a few tokens later decide that the response is banned.

i don't understand the image. the arrow is inwards in case of both input and output embeddings. and there's this extra part in output layer. i don't even know what these arrows are saying.

// from transformers pdf

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

ok. so in each of the 6 layers, there are 2 sub layers. one is multiheaded attention and the other is feed forward network. this is for encoder. normalization is done on both these sub layers. the output sequence of this encoder is given to the decoder part.

in the decoder part, there's an extra sub layer - total 3. this extra sub layer has multiheaded attention but there's some small modification.

wait. i don't get it. are the first two layers doing the same job in both encoder and decoder? if yes, why unnecessary duplication?

[prashna]

"attention" ; what is it? so the attention function is just mapping of query - (key value pair) - output

all these(q, k, v, o) are vectors. output is weighted sum of values. i think these are the "weights" that everybody talks about while tlaking about LLMs;

scaled dot product attention : multi headed attention:

// grok 3

what is scaled dot product attention and multi headed attention in the context of LLMs, transformers architecture?

Let’s break this down step-by-step to explain scaled dot-product attention and multi-headed attention in the context of Large Language Models (LLMs) and the Transformer architecture. Scaled Dot-Product Attention Scaled dot-product attention is a core mechanism in the Transformer architecture, introduced in the paper "Attention is All You Need" (Vaswani et al., 2017). It’s a way to compute how much focus (or "attention") one token (word, subword, etc.) in a sequence should pay to another token, based on their relationship. Here’s how it works: Inputs: The mechanism takes three vectors for each token in the sequence: Query (Q): Represents the token asking "who should I pay attention to?"

Key (K): Represents the token being compared against.

Value (V): Contains the actual information to be passed along if attention is assigned.

These Q, K, and V vectors are derived from the input embeddings by multiplying them with learned weight matrices.

Dot Product: The similarity between the Query and Key vectors is computed using a dot product: Q⋅KTQ \cdot K^TQ \cdot K^T . This produces a score indicating how relevant each Key is to the Query.

Scaling: The dot product is scaled by dividing it by the square root of the dimension of the Key vectors (dk\sqrt{d_k}\sqrt{d_k} ). This step prevents the scores from growing too large as the dimensionality increases, which could destabilize training due to large gradients.

Formula: scaled score=Q⋅KTdk\text{scaled score} = \frac{Q \cdot K^T}{\sqrt{d_k}}\text{scaled score} = \frac{Q \cdot K^T}{\sqrt{d_k}} .

Softmax: The scaled scores are passed through a softmax function to convert them into probabilities (weights) that sum to 1. This determines how much attention each Value gets.

Weighted Sum: The final output is a weighted sum of the Value vectors, where the weights come from the softmax output: Attention(Q,K,V)=softmax(Q⋅KTdk)⋅V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \cdot V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \cdot V .

This mechanism allows the model to dynamically focus on the most relevant parts of the input sequence for each token, making it highly effective for tasks like translation or text generation. Multi-Headed Attention Multi-headed attention builds on scaled dot-product attention by running it multiple times in parallel, allowing the model to capture different types of relationships between tokens. Here’s the breakdown: Multiple Heads: Instead of performing attention once, the input Q, K, and V vectors are split into multiple smaller sets (or "heads"). Each head computes its own scaled dot-product attention independently.

For example, if the original dimension is 512 and there are 8 heads, each head works with vectors of dimension 64 (512 ÷ 8).

Parallel Computation: Each head has its own set of learned weight matrices for Q, K, and V, so they can focus on different aspects of the input. One head might capture syntactic relationships (e.g., subject-verb agreement), while another captures semantic relationships (e.g., topical similarity).

Concatenation: After computing attention for each head, the outputs are concatenated back together into a single vector.

Linear Transformation: The concatenated output is passed through a final linear layer (with another learned weight matrix) to produce the final multi-head attention output, which matches the original input dimension.

Formula: MultiHead(Q,K,V)=Concat(head1,head2,...,headh)⋅WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, ..., \text{head}_h) \cdot W^O\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, ..., \text{head}_h) \cdot W^O ,

where headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V) , and WiQ,WiK,WiVW_i^Q, W_i^K, W_i^VW_iQ, W_i^K, W_i^V are the weight matrices for the ( i )-th head, and WOW^OWO is the output projection matrix.

Why Use These in Transformers/LLMs? Scaled Dot-Product Attention: Efficiently captures dependencies between tokens regardless of their distance in the sequence, unlike RNNs or LSTMs, which struggle with long-range dependencies. The scaling ensures numerical stability.

Multi-Headed Attention: Enhances the model’s capacity by allowing it to attend to multiple types of relationships simultaneously. This makes Transformers highly expressive and adaptable to complex language patterns.

In LLMs like BERT, GPT, or others based on Transformers, these mechanisms are stacked in multiple layers, enabling the model to build rich, hierarchical representations of text. For example, in the encoder-decoder structure of the original Transformer, multi-head attention is used both within the input sequence (self-attention) and between the input and output sequences (cross-attention).

medium link - scaled dot product attention

01:29 AM 18-03-2025

i'm really slow. anyways, in case of self attention - there is a series of operations - mul(q,k) -> norm(root(dk)) -> softmax -> softmax * v

for multihead attention, the model is focusing on word-relations in various directions. q, k, v vectors are split and then operations are performed. after that, they are concatenated back into a single vector as expected.

i still don't get the full vibe of it. i'm going to read https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention

Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs

12:53 AM 21-03-2025

yep. accidental restart just deleted my unpublished blog entry. cool. less complaining now.

anyways, seems like good time to take a look back on everything and learn ;

08:41 PM 21-03-2025

vague understanding of LLMs

assume Transformer architecture image

there are basically 2 parts in a transformer architecture - encoder and decoder. you take the input and embed it into vectors. and then there are these 3 things - q, k, v - query, key, value

it's kind of like browser search query retrieval analogy. "query" represents the token asking - "which token should i focus on?"; keys - "keys are same as in dictionary. keys map to the 'right'/ appropriate tokens"; values - "this is what the keys point to. values are apparently the right information retrieved in accordance with the queries" q, v are part of encoder and k is part of decoder so that keys can always have complete context of the input sentence.?

once the input text is embedded into vectors, you normalize it by scaling it by 1/root(dk) where dk is the dimension of the keys matrix. it's basically going to be Wq, Wk, Wv weights that are going to be trained to get the optimal output. i don't know how neural networks work. so i need to train a neural network once at least; anyways, after the normalization, the dot product of q and k will result in v. this is how they're calculating the similarity or at least vaguely. i'm not sure. and then there's this thing called scaled dot product attention where you dot product q & k and apply softmax. this whole paragraph stuff is part of self attention.

there's also multiheaded attention - a concatenation of multiple self attentions in parallel. each self attention will handle a particular domain which will let the model understand the stuff better.

this is what i understood after reading the complete paper once eventhough the second half of the paper was just baremin skimming. i need to read it all again and answer the previous unanswered questions.

the annotated transformer - harvardnlp

should i read this or should i re-read the original paper and try to figure out stuff from that? i'll give the original paper one more shot.

i'll start a new page blog.