type
status
date
slug
summary
tags
category
icon
password
YouTubeYouTubeLet's build the GPT Tokenizer
LLMs aren’t built on words, or Chinese characters, or any natural language alphabet directly, but they are built on tokens!

Tokens

What are tokens?
Can image there is a big fixed “vocabulary”, each token is a “word” in the vocabulary. Just, this vocabulary maybe not human-readable directly. For example, the utf-8 encoding of the string "안” is [236, 149, 136]. These can be three tokens if we use 0-256 as the whole vocabulary.
How many tokens do we want?
Remember that LLMs first map each token to an embedding and in the last, map the result to the probability of each token. These two maps are trainable and includes vocab_size * embed_size parameters. So, the size of the vocabulary has an impact of the total model size. This limits the vocabulary size (the total number of tokens).
Also, during training and inferencing, for a current token, there is a context window (an array of tokens) where this token can look at. Therefore, we want the tokens as representable as possible. That encourages to have more tokens.
  • For example, imagine tokenizing a python snippet, for the empty space “ ”, we want to use one token to represent the four empty spaces, instead of four same tokens where each represents one empty space. The second way makes the tokens longer and more repeated information in the restricted context window.
    • This encourages to have more tokens. (Because we need to a token to represent one empty space, and we want a second one to represent four continuous empty space. It will be better if we have a third one to represent eight continuous empty space!)
    🗒️
    Therefore, there is a trade-off between the size of the vocabulary and the representation ability of it.
    Currently, it is common to be an empirical number, around 100K.
     

    How to translate human-readably strings to tokens?

    Tokenizer! A tokenizer translates back and forth between raw texts and sequences of tokens.
    https://tiktokenizer.vercel.app/ this website has popular models’ tokenizers to play.
    • A tokenizer can be manual labeling. For example, I label “english” as 1, “English” as 2, “en” as 3, “En” as 4, etc. Basically this requires completely list all the words in all the languages in all the format.
      • Risk: If any form of any word is missing, the LLM will fail to “understand” it.
    • It can also be automatic. One example is BPE algorithm (BPE stands for byte pair encoding).
      • The general idea of BPE is counting the how many times each pair of bytes in the dataset show up, replace the most frequent pair with one token. Keep going until having a good vocabulary size.
      • For example, strings encoded in utf-8 can have 0-255 byte representation (The vocabulary is 0-255). "hiiii dog".encode("utf-8") which has tokens [104, 105, 105, 105, 105, 32, 100, 111, 103].
        • Here the most frequent pair is (105, 105). We add a new token 256 to replace (105, 105), the vocabulary is 0-256, and the tokens become [104, 256, 256, 32, 100, 111, 103].
     

    How to introduce new tokens to a trained model?

    As said before, introducing new tokens only has an impact on “LLMs first map each token to an embedding and in the last, map the result to the probability of each token. ”. It is doable to extend the input embedding and the output linear module and then fix all the existing parameters but only train the parameters for the new tokens.
    • An application: prompt compression. Assume you have a very long prompt, you can introduce several new tokens to replace the long prompt tokens and fine-tune the model for the new tokens.