Attention

Suppose you give an LLM the input What is the capital of France? The first thing the LLM will do is split this input into tokens. A token is just some combinations of characters. You can see an example of the tokenization outputs for the question below. $\colorbox{red}{What}\colorbox{magenta}{ is}\colorbox{green}{ the}\colorbox{orange}{ capital}\colorbox{purple}{ of}\colorbox{brown}{ France}\colorbox{cyan}?$ (This tokenization was produced using cl100k_base, the tokenizer used in GPT-3.5-turbo and GPT-4.) In this example we have $(n = 7)$ tokens....

Attention

Introuction to Attention

Fancy Pants Attention Techniques