Tokenizer for LLM

type

status

date

slug

summary

category

icon

password

YouTubeLet's build the GPT Tokenizer

Let's build the GPT Tokenizer

The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely. Chapters: 00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues 00:05:50 tokenization by example in a Web UI (tiktokenizer) 00:14:56 strings in Python, Unicode code points 00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32 00:22:47 daydreaming: deleting tokenization 00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough 00:27:02 starting the implementation 00:28:35 counting consecutive pairs, finding most common pair 00:30:36 merging the most common pair 00:34:58 training the tokenizer: adding the while loop, compression ratio 00:39:20 tokenizer/LLM diagram: it is a completely separate stage 00:42:47 decoding tokens to strings 00:48:21 encoding strings to tokens 00:57:36 regex patterns to force splits across categories 01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex 01:14:59 GPT-2 encoder.py released by OpenAI walkthrough 01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences 01:25:28 minbpe exercise time! write your own GPT-4 tokenizer 01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary 01:43:27 how to set vocabulary set? revisiting gpt.py transformer 01:48:11 training new tokens, example of prompt compression 01:49:58 multimodal [image, video, audio] tokenization with vector quantization 01:51:41 revisiting and explaining the quirks of LLM tokenization 02:10:20 final recommendations 02:12:50 ??? :) Exercises: - Advised flow: reference this document and try to implement the steps before I give away the partial solutions in the video. The full solutions if you're getting stuck are in the minbpe code https://github.com/karpathy/minbpe/blob/master/exercise.md Links: - Google colab for the video: https://colab.research.google.com/drive/1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L?usp=sharing - GitHub repo for the video: minBPE https://github.com/karpathy/minbpe - Playlist of the whole Zero to Hero series so far: https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ - our Discord channel: https://discord.gg/3zy8kqD9Cp - my Twitter: https://twitter.com/karpathy Supplementary links: - tiktokenizer https://tiktokenizer.vercel.app - tiktoken from OpenAI: https://github.com/openai/tiktoken - sentencepiece from Google https://github.com/google/sentencepiece

LLMs aren’t built on words, or Chinese characters, or any natural language alphabet directly, but they are built on tokens!

Tokens

What are tokens?

Can image there is a big fixed “vocabulary”, each token is a “word” in the vocabulary. Just, this vocabulary maybe not human-readable directly. For example, the utf-8 encoding of the string "안” is [236, 149, 136]. These can be three tokens if we use 0-256 as the whole vocabulary.

How many tokens do we want?

Remember that LLMs first map each token to an embedding and in the last, map the result to the probability of each token. These two maps are trainable and includes vocab_size * embed_size parameters. So, the size of the vocabulary has an impact of the total model size. This limits the vocabulary size (the total number of tokens).

Also, during training and inferencing, for a current token, there is a context window (an array of tokens) where this token can look at. Therefore, we want the tokens as representable as possible. That encourages to have more tokens.

For example, imagine tokenizing a python snippet, for the empty space “ ”, we want to use one token to represent the four empty spaces, instead of four same tokens where each represents one empty space. The second way makes the tokens longer and more repeated information in the restricted context window.

This encourages to have more tokens. (Because we need to a token to represent one empty space, and we want a second one to represent four continuous empty space. It will be better if we have a third one to represent eight continuous empty space!)

🗒️

Therefore, there is a trade-off between the size of the vocabulary and the representation ability of it.

Currently, it is common to be an empirical number, around 100K.

How to translate human-readably strings to tokens?

Tokenizer! A tokenizer translates back and forth between raw texts and sequences of tokens.

https://tiktokenizer.vercel.app/ this website has popular models’ tokenizers to play.

A tokenizer can be manual labeling. For example, I label “english” as 1, “English” as 2, “en” as 3, “En” as 4, etc. Basically this requires completely list all the words in all the languages in all the format.

Risk: If any form of any word is missing, the LLM will fail to “understand” it.

It can also be automatic. One example is BPE algorithm (BPE stands for byte pair encoding).

The general idea of BPE is counting the how many times each pair of bytes in the dataset show up, replace the most frequent pair with one token. Keep going until having a good vocabulary size.
For example, strings encoded in utf-8 can have 0-255 byte representation (The vocabulary is 0-255). "hiiii dog".encode("utf-8") which has tokens [104, 105, 105, 105, 105, 32, 100, 111, 103].

Here the most frequent pair is (105, 105). We add a new token 256 to replace (105, 105), the vocabulary is 0-256, and the tokens become [104, 256, 256, 32, 100, 111, 103].

How to introduce new tokens to a trained model?

As said before, introducing new tokens only has an impact on “LLMs first map each token to an embedding and in the last, map the result to the probability of each token. ”. It is doable to extend the input embedding and the output linear module and then fix all the existing parameters but only train the parameters for the new tokens.

An application: prompt compression. Assume you have a very long prompt, you can introduce several new tokens to replace the long prompt tokens and fine-tune the model for the new tokens.

Very interesting tokenization issues: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation