TiktokenEncoder

constructor(vocabulary: Map<ByteArrayKey, Int>, pattern: Regex, unkTokenId: Int)(source)

Initializes the TiktokenEncoder with the specified vocabulary, regex pattern for matching tokens, and a token ID for unknown tokens.

Parameters

vocabulary

A mapping of ByteArrayKey to token ID that represents the token encoding vocabulary.

pattern

A regular expression used to match text patterns for initial token matching.

unkTokenId

The token ID used for unknown tokens not present in the vocabulary.