𝔞𝔱𝔱𝔢𝔫𝔱𝔦𝔬𝔫 & attention is all you need

22 Mar, 2025

12:35 PM 22-03-2025

बन्धुरात्मात्मनस्तस्य येनात्मैवात्मना जित: |

अनात्मनस्तु शत्रुत्वे वर्ते तात्मैव शत्रुवत् ||

mind can be a friend or a foe depending on how you sculpt it;

stay focused on your tasks. attention is all you need;

sequence transduction models existed. ( i feel like i'm regurgitating the same thoughts instead of having improved thoughts. maybe there isn't much info in abstract, or there is? ). these were based on RNN. couldn't handle very long sequences because it has to deal with the sequence one by one unit instead of allowing a parallel computation.

transformer architecture is what they came up with to handle this parallelization with this new architecture of having attention mechanisms which are capable of 'attending' to the appropriate input whenever needed. this even took less time compared to RNNs to train and run.

eschew = "to avoid deliberately"

mistral raised almost a billion dollars and that's the output you get. not dunking on mistral. big fan. but the fact that a billion dollars will still get you to mistral(if you're lucky) and not openai. wait. but deepseek? yes they had everything setup ready just to let the engineers loose on AI. but they have a fucking hedge fund company. so, deepseek might be actually more expensive if you pan out. so there's that. i have no idea why i'm delusional enough to believe that i can start my work now and turn it into something big. maybe it's because the LLM domain is still at the beginning stage eventhough it began years ago. there's more to come. i believe it.

they had bytenet and convs2s but it's difficult for these to learn the relation between 2 entities/tokens as the distance increases. this issue can resolved by using transformer architecture, in which there are fewer number of operations required to find the relationship between 2 entities. but this reduces the 'quality'(idk how). this is counteracted using multihead attention - similar to self attention but with multiple of those together working on individual domains each which will be concatenated in the end.

""To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. ""

11:49 PM 30-03-2025

remember last week i talked about learning this thing. yeah, i missed it. again. this must've been at least the 1000th time i stopped something midway. anyways, fuck this shit. i have no other option but to follow this path. the problem isn't that i'm unwilling to change my archaic ways. the problem is that i've seen myself following the same archaic ways and succeed. i'm kind of in a delusion believing that i'll somehow figure things out "the right way". alright, let's try again, lol.

let's take a look at model architecture. ok. now that i see it, i remember bookmarking many sites for reference. where are they now? i'm not sure. can't find them now. that's an issue. must maintain some sort of notes. the blog would've worked if i worked. that's too much bearish talk. forget it. just focus on the task.

"neural sequence transduction models" - i don't what it means

it just means a neural network that takes a sequence (of tokens) and 'converts' (transduction) into output token sequence.

x1...xn is the input;

z1...zn is the output; (from the training dataset?)

in a general neural sequence transduction model, we've got an encoder and a decoder. in this case, the encoder maps the input X to output Z.

just consider X,Y,Z to be a continuous sequence of symbol representations for now. given Z, the decoder will generate an output sequence of Y - y1...ym; if Y is output, what was Z supposed to be? maybe those are all just vector representations in between the process. idk right now. also, it's an autoregressive model implying that the elements generated currently will be used as an input for future outputs. i mean, it's obvious that you have to give the model the first half of the answer if you want it to continue the answer based on the first half to generate the second half of the answer. it can't magically know and generate second half of the answer before knowing the first half.

but wait, just read that thing about generating poems in anthropic's blog. the model apparently 'fires' up the rhyming words and plans accordingly to set the previous words appropriately. https://transformer-circuits.pub/2025/attribution-graphs/biology.html

""Planning in Poems: We discover that the model plans its outputs ahead of time when writing lines of poetry. Before beginning to write each line, the model identifies potential rhyming words that could appear at the end. These preselected rhyming options then shape how the model constructs the entire line.""

see, i waste time on these. for some unknown stupid reason, i need to make the upper poem text headline but not the even more upper one. the first one is already block size but the same thing can't be replicated below. can't even get it after copy pasting. i have no idea why these 2 are being treated as 2 different cases but shit like this makes me lose my mind. anyways, if it works fine currently, i'll try to be content.

the more i look at the architecture image, the more i feel the need to get my basics of neural networks. instead of vaguely making sense, maybe it'll completely make sense. so, where do we find neural networks blogs? karpathy.

https://www.youtube.com/watch?v=VMj-3S1tku0&ab_channel=AndrejKarpathy

10:18 PM 31-03-2025

ok. we're starting the lecture now.

i think i found that stupid issue;

backpropagation - efficiently calculating gradient of some loss function wrt weights in a neural network

a gradient/derivative gives us the info of how fast a particular value is changing at a particular point when it's shifted by an infinitesimally small distance in positive direction

consider a function $f (x) = 3 x^{2} - 4 x + 5$ obviously you can differentiate it using chain rule and find that $f^{'} (x) = 6 x - 4$ but in reality, it is just simply

f^{'} (x) = (f (x + h) - f (x)) / h

where 'h' is the infinitesimally small number => h will be as close to 0 as possible; as h tends to '0' the value of this f'(x) will be considered as the derivative

am i writing all this unnecessarily?