Papers
arxiv:2601.14051

Kakugo: Distillation of Low-Resource Languages into Small Language Models

Published on Jan 20
Authors:
,
,
,
,

Abstract

A cost-effective pipeline for training small language models in low-resource languages using synthetic data generation from large teacher models, achieving improved performance across multiple NLP tasks.

AI-generated summary

We present Kakugo, a novel and cost-effective pipeline designed to train general-purpose Small Language Models (SLMs) for low-resource languages using only the language name as input. By using a large teacher model to generate synthetic prompts and translate instruction datasets, we produced training data and SLMs for 54 low-resource languages. Evaluations across a diverse set of general natural language processing tasks, including translation, classification, and question answering, demonstrate that our pipeline consistently improves performance over base models. With a total generation and training cost of under $50 per language, Kakugo offers an accessible method for communities to develop language-specific AI.

Community

Sign up or log in to comment

Models citing this paper 57

Browse 57 models citing this paper

Datasets citing this paper 55

Browse 55 datasets citing this paper

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.14051 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.