prompt-tokenizer/ai.koog.prompt.tokenizer/SimpleRegexBasedTokenizer

SimpleRegexBasedTokenizer

class SimpleRegexBasedTokenizer : Tokenizer(source)

A simple regex-based tokenizer that splits text on whitespace and common punctuation.

This tokenizer provides a reasonable approximation of token counts for most LLMs, though it's not as accurate as model-specific tokenizers. It's efficient and doesn't require any external dependencies.

Note: Ollama does not provide tokens in responses, so this client-side estimation is necessary for token counting.

Constructors

SimpleRegexBasedTokenizer

constructor()

Functions

countTokens

open override fun countTokens(text: String): Int

Counts tokens by splitting on whitespace and common punctuation.