TiktokenEncoder
open class TiktokenEncoder(vocabulary: Map<ByteArrayKey, Int>, pattern: Regex, unkTokenId: Int) : Tokenizer(source)
A tokenizer implementation that uses a token encoding vocabulary and a regex pattern to tokenize text into a series of token IDs. The tokenization process utilizes byte pair encoding (BPE) for segments of text not directly found in the vocabulary.
This implementation is based on the reference Rust implementation from https://github.com/openai/tiktoken/blob/main/src/lib.rs.
Parameters
vocabulary
A mapping of ByteArrayKey to token ID that represents the token encoding vocabulary.
pattern
A regular expression used to match text patterns for initial token matching.
unkTokenId
The token ID used for unknown tokens not present in the vocabulary.