Introuction to Attention

Suppose you give an LLM the input What is the capital of France? The first thing the LLM will do is split this input into tokens. A token is just some combinations of characters. You can see an example of the tokenization outputs for the question below. $\colorbox{red}{What}\colorbox{magenta}{ is}\colorbox{green}{ the}\colorbox{orange}{ capital}\colorbox{purple}{ of}\colorbox{brown}{ France}\colorbox{cyan}?$ (This tokenization was produced using cl100k_base, the tokenizer used in GPT-3.5-turbo and GPT-4.) In this example we have $(n = 7)$ tokens....

March 30, 2024

Fancy Pants Attention Techniques

Sparse Attention Sparse Attention introduces sparse factorizations on the attention matrix. To implement this we introduce a connectivity pattern $S = {S_1,\dots,S_n}$. Here, $S_i$ denotes the set of indices of the input vectors to which the $i$th output vector attends. For instance, in regular $n^2$ attention every input vector attends to every output vector before it in the sequence. Remember that $d_k$ is the inner dimension of our queries and keys....

March 23, 2024