prompt-tokenizer/ai.koog.prompt.tokenizer.tiktoken/TiktokenEncoder

TiktokenEncoder

open class TiktokenEncoder(vocabulary: Map<ByteArrayKey, Int>, pattern: Regex, unkTokenId: Int) : Tokenizer(source)

A tokenizer implementation that uses a token encoding vocabulary and a regex pattern to tokenize text into a series of token IDs. The tokenization process utilizes byte pair encoding (BPE) for segments of text not directly found in the vocabulary.

This implementation is based on the reference Rust implementation from https://github.com/openai/tiktoken/blob/main/src/lib.rs.

Parameters

vocabulary

A mapping of ByteArrayKey to token ID that represents the token encoding vocabulary.

pattern

A regular expression used to match text patterns for initial token matching.

unkTokenId

The token ID used for unknown tokens not present in the vocabulary.

Constructors

TiktokenEncoder

constructor(vocabulary: Map<ByteArrayKey, Int>, pattern: Regex, unkTokenId: Int)

Initializes the TiktokenEncoder with the specified vocabulary, regex pattern for matching tokens, and a token ID for unknown tokens.

Functions

countTokens

open override fun countTokens(text: String): Int

Counts the number of tokens in the given text.