Byte-Pair Encoding Merge Tables (tokenization)

Byte-Pair Encoding (BPE) is the core algorithm used to train LLM tokenizers (like Tiktoken or SentencePiece). It builds a vocabulary by iteratively merging the most frequent pairs of characters or bytes.

The Algorithm Steps

Initialize Vocabulary: Start with all individual bytes (values 0-255) as base tokens. The vocabulary size is initially 256.
Tokenize Corpus: Convert the training text corpus into a list of byte integers.
Count Adjacent Pairs: Find all adjacent pairs of token IDs and count their frequencies in the corpus.
Merge Most Frequent Pair: Identify the most frequent pair (t_i, t_j). Create a new token ID t_new and add it to the vocabulary. Replace all occurrences of (t_i, t_j) in the tokenized corpus with t_new.
Record Merge: Add the merge operation (t_i, t_j) -> t_new to the BPE merge table.
Repeat: Go back to step 3 and repeat until the vocabulary reaches the target size (e.g. 50,000 or 100,000).

Python Training Reference

Here is the core logic for training a BPE tokenizer:

def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts
 
def merge(ids, pair, idx):
    newids = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
            newids.append(idx)
            i += 2
        else:
            newids.append(ids[i])
            i += 1
    return newids
 
# Example training loop
def train_bpe(raw_bytes, vocab_size):
    num_merges = vocab_size - 256
    ids = list(raw_bytes)
    merges = {} # (int, int) -> int
    
    for i in range(num_merges):
        stats = get_stats(ids)
        if not stats:
            break
        # Find the pair with highest count
        best_pair = max(stats, key=stats.get)
        new_id = 256 + i
        ids = merge(ids, best_pair, new_id)
        merges[best_pair] = new_id
        
    return merges

The Algorithm Steps

Python Training Reference

Share Reference Sheet