TokSuite

community

AI & ML interests

Tokenization, Robustness, LLMs

Recent Activity

TokSuite Logo

TokSuite is a collection of models and benchmarks designed to isolate and study the impact of tokenization on language model behavior across English, Chinese, Turkish, Italian, and Farsi languages, as well as STEM and mathematical text. It includes fourteen models that share the same architecture, training data, training budget, and initialization but differ only in their tokenizers, alongside a set of benchmarks that evaluate performance under real-world perturbations that affect tokenization.

Our code is available at https://github.com/r-three/Tokenizers.