The State of Uzbek Data for AI

A Comprehensive Overview of Open-Source Uzbek Datasets and Their Roles in LLM Development

1. Introduction

As artificial intelligence continues to reshape industries worldwide, language remains the foundation of intelligent systems. For Uzbekistan, the future of AI depends on the ability of models to understand, reason, and communicate in Uzbek—a rich language with a rapidly digitalizing presence but still limited representation in machine learning datasets.

While English, Chinese, and other models benefit from vast open corpora, Uzbek data remains fragmented and under-structured. This article provides a detailed exploration of available Uzbek datasets, classifies them by type, and explains how each class contributes to building modern large language models (LLMs) and NLP models. It also presents an exploratory data analysis (EDA) of open-source Uzbek data to identify what exists today and what is still missing for true Uzbek AI capability.

2. Why Uzbek Data Matters

Training a large language model is not only about collecting words - it is about teaching language comprehension, task understanding, and reasoning. Each category of dataset serves a distinct function in that process:

Raw text corpora teaches grammar, structure, and contextual flow.
Instruction datasets teach models to follow human directions and respond appropriately.
Chain-of-Thought (CoT) data develop step-by-step reasoning and explainable logic.
Annotated datasets train and evaluate specific skills such as classification or information extraction.
Evaluation datasets are used to measure performance across language understanding, reasoning, and instruction-following tasks. They provide standardized benchmarks for comparing LLMs.

Understanding how much data exists in each category helps organizations and researchers design the next generation of Uzbek-capable AI systems.

3. Classes of Uzbek Data

Based on an examination of open repositories such as Hugging Face, Mendeley Data, and academic corpora, Uzbek data can be grouped into four principal classes:

Class	Description	Representative Datasets	Primary Use in Model Training
Raw Text Corpora	Unstructured collections of text (books, news, and web pages).	Tahrirchi/uz-crawl, murodbek/uz-books	Pretraining: builds fundamental Uzbek grammar, syntax, and style.
Instruction Datasets	Prompt-response examples showing how to follow user instructions.	UAzimov/uzbek-instruct-llm, Behbudiy/alpaca-cleaned-uz, Behbudiy/translation-instruction	Supervised fine-tuning: teaches task execution and conversational behavior.
Chain-of-Thought (CoT) Data	Multi-step reasoning traces showing how conclusions are reached.	(No publicly available Uzbek dataset yet)	Reasoning fine-tuning: trains logical, interpretable problem solving.
Annotated / Task-Specific Data	Labeled datasets for NLP tasks (e.g., sentiment, NER, semantic similarity).	Behbudiy/uzbek-sentiment-analysis NER for Uzbek (Mendeley)	Evaluation and specialization: supports classification, labeling, or entity recognition.
Evaluation datasets	Benchmarks used to measure model performance in language understanding, reasoning, and instruction following	Tahrirchi/uzlib, SimReIUz	Performance assessment — evaluates comprehension, factual accuracy, fluency, across diverse Uzbek language tasks.

4. EDA of Open-Source Uzbek Datasets

In linguistic coverage, Uzbek data is sufficient for language modeling: large corpora-like uz-crawl and uz-books ensure grammatical and lexical diversity. However, in reasoning and structured task learning, the resources are minimal. Models trained purely on existing data can speak Uzbek but cannot reason in Uzbek—they lack exposure to instructional patterns, logic sequences, and structured decision-making examples.

Dataset Name	Estimated Total Volume / Rows	Description & Limitation	How it was Collected / Source
murodbek / uz-books	≈ 49.93 GB text (≈ 40 000 books)	Large literary & academic corpus in both Latin and Cyrillic Uzbek. Excellent for pre-training language fluency and style.	Digitized public-domain Uzbek books from online libraries; cleaned and de-duplicated to plain text. (Hugging Face)
tahrirchi / uz-crawl	≈ 3.41 GB (raw) → ≈ 1.7 GB processed / ≈ 30 M sentences	Web-crawled Uzbek text (news, blogs, forums). Broad topical diversity but some noise and spelling variation.	Automatically crawled and language-filtered; HTML tags removed and duplicates dropped. (Hugging Face)
UAzimov / uzbek-instruct-llm	≈ 12.6 MB / ≈ 15 000 prompt–response pairs	General instruction dataset (Q&A / task-following). Good for initial supervised fine-tuning; lacks domain-specific reasoning.	Crowdsourced and translated from English (Alpaca / ShareGPT) prompts; manually cleaned. (Hugging Face)
Behbudiy / alpaca-cleaned-uz	≈ 25.1 MB / ≈ 51 760 rows	High-quality Uzbek translation of Stanford Alpaca. Enables general instruction-following. Still general-purpose, not legal or analytical.	Machine-translated and human-post-edited by Behbudiy Lab from the English Alpaca set. (Hugging Face)
Behbudiy / translation-instruction	≈ 13.8 MB / ≈ 20 000 examples	Instruction-style parallel data (English ↔ Uzbek). Useful for bilingual fine-tuning; limited to translation tasks.	Derived from existing parallel corpora reframed as “Translate this sentence into X” prompts. (Hugging Face)
Behbudiy / uzbek-sentiment-analysis	≈ 10 MB / ≈ 10 000 labeled samples	Supervised sentiment dataset (positive / negative / neutral). Good for evaluation but not for reasoning.	Manually labeled reviews and social-media texts from Uzbek online sources. (Hugging Face)
UzABSA (Sanatbek / aspect-based-sentiment-analysis-uzbek)	≈ 3 500 reviews / ≈ 6 100 sentences ( ~ 3–5 MB )	Aspect-based sentiment dataset for fine-grained opinion mining (restaurant domain). Useful for NLP benchmarking but small scale.	Collected from Uzbek restaurant reviews; manually annotated for aspect terms and polarities. (ACL Anthology 2024)
SimRelUz (Semantic Relatedness)	≈ 1 MB / 1 418 word pairs	Semantic similarity and relatedness ratings for Uzbek word pairs. Too small for training but valuable for embedding evaluation.	Word pairs rated (1–5 scale) by 11 native Uzbek speakers for similarity and relatedness. (arXiv 2205.06072)

5. How Each Class Contributes to Model Training

Model Stage	Data Type	Objective
Pretraining	Raw text corpora	Establish Uzbek language understanding—grammar, morphology, sentence formation.
Supervised Fine-Tuning	Instruction datasets	Teach the model to interpret user intent and deliver structured responses.
Reasoning Fine-Tuning	CoT datasets	Enable step-by-step legal, mathematical, or procedural reasoning.
Evaluation & Domain Adaptation	Annotated datasets	Benchmark and refine task-specific accuracy (e.g., sentiment, entities, classification).

Without instruction and CoT data, a model trained solely on text will appear fluent yet behave superficially—unable to follow directives, justify decisions, or apply rules.

6. Key Findings from the EDA

The Uzbek data ecosystem is text-rich but behavior-poor.
Most datasets contain free text rather than supervised interactions or reasoning traces.
Instructional data is emerging but not domain-aligned.
Current examples are general rather than specialized (e.g., law, education, healthcare).
Reasoning (CoT) data remains the largest gap.
No open Uzbek dataset provides reasoning or explanation examples—limiting higher-order understanding.
Annotated datasets serve auxiliary purposes.
They are useful for evaluation but cannot teach complex cognitive behavior.

7. Implications for Future Uzbek AI Development

The current landscape demonstrates that Uzbekistan has made progress in linguistic data availability, but still lacks reasoning-oriented, instruction-rich corpora.

To advance national AI research and productization:

Develop domain-specific instruction datasets (e.g., legal Q&A, civic information, educational tutoring).
Create Chain-of-Thought datasets in Uzbek that show intermediate reasoning steps and explanations.
Establish standardized benchmarks to measure understanding, reasoning, and factual correctness in Uzbek.
Encourage open collaboration between universities, AI labs, and government agencies to share curated, structured data.

8. Translation Data for Uzbek ↔ English

8.1 What Translation Data Is

Translation datasets are called parallel corpora — collections of sentence pairs (or paragraphs) in two or more languages that express the same meaning.

Each record has the structure:

{
  "source": "Men Toshkentda yashayman.",
  "target": "I live in Tashkent."
}

They are used to train and evaluate neural machine translation (NMT) models such as MarianMT, M2M100, or NLLB (No Language Left Behind).

In contrast to raw text corpora, which teaches language patterns, parallel data explicitly teach cross-lingual alignment — how a sentence in Uzbek corresponds semantically and syntactically to one in English (or another language).

8.2 Types of Translation Data

There are three main categories of translation datasets relevant to Uzbek:

Type	Description	Example Datasets	Use Cases
Parallel Sentence Corpora	Sentence-level aligned pairs between Uzbek and other languages (English, Russian, Kazakh).	OPUS Tatoeba, Tilde MODEL Corpus, Parallel Uzbek–Kazakh corpus (PMC)	Training bilingual NMT models (e.g., English ↔ Uzbek).
Instructional Translation Datasets	Parallel text expressed as instruction–response format (translation as a task).	Behbudiy/translation-instruction	Fine-tuning general LLMs to perform translation via instruction tuning (“Translate this sentence into English”).
Multilingual General Corpora	Large web crawls or mixed datasets that include Uzbek among many other languages (often automatically aligned).	NLLB (Meta), OPUS GlobalVoices, CCAligned, WikiMatrix	Building multilingual or massively multilingual translation systems.

8.3 Overview of Available Uzbek Translation Resources

Dataset	Source	Languages	Description / Quality
Behbudiy / translation-instruction	Hugging Face	Uzbek ↔ English	20k instruction-form translation pairs. Manually curated and cleaned. Useful for LLM instruction fine-tuning.
OPUS Tatoeba	University of Helsinki	Uzbek ↔ English (and others)	A multilingual collection of example sentences and translations. Small, clean, conversational.
WikiMatrix	Facebook AI	Uzbek ↔ English	Sentence-aligned Wikipedia texts. Medium quality; alignment sometimes noisy.
CCAligned / CCMatrix	Facebook AI	Uzbek ↔ English	Large-scale automatically aligned Common Crawl data. Very noisy, requires filtering.
NLLB Seed Data	Meta (No Language Left Behind project)	Uzbek ↔ 200+ languages	Used internally by Meta to train NLLB-200 models. Public evaluation set available; raw data not fully open.
Parallel Uzbek–Kazakh Corpus	Academic dataset (PMC 2024)	Uzbek ↔ Kazakh	Sentence-level alignment of administrative and news content; suitable for regional cross-lingual transfer.

8.4 How These Datasets Are Used

a) For Traditional Machine Translation (NMT)

Parallel Uzbek–English corpora are used to train or fine-tune models like:

MarianMT, mBART, NLLB, or M2M100
Objective: learn direct translation mapping via supervised sequence-to-sequence modeling.

b) For Multilingual LLMs and Instruction Fine-Tuning

Translation pairs can also be reframed as instruction tasks to fine-tune multilingual LLMs.

Example format:

{
  "instruction": "Translate the following Uzbek sentence into English.",
  "input": "Bu qonun 2023-yil 15-iyulda qabul qilingan.",
  "output": "This law was adopted on July 15, 2023."
}

This method doesn’t just teach the model bilingual equivalence — it also teaches how to perform translation on request, integrating translation into the model’s conversational behavior.

c) For Alignment and Multilingual Embeddings

Some Uzbek–English corpora are used to train multilingual embedding models (like LaBSE, LASER3).
These models learn a shared vector space where semantically similar sentences in both languages have nearby representations.
This is essential for multilingual search, cross-lingual RAG systems, or translation memory tools.

8.5 Quality and Coverage Challenges

Despite progress, Uzbek translation datasets face several limitations:

Data Size — Most Uzbek↔English corpora are small (tens of thousands of pairs), compared to tens of millions for major languages.
Domain Imbalance — Many pairs come from Wikipedia or news sources; there’s limited legal, technical, or colloquial data.
Alignment Noise — Automatically aligned datasets (like CCAligned) contain significant mismatches due to inconsistent sentence segmentation.
Script Variation — Uzbek appears in both Latin and Cyrillic scripts, complicating alignment and requiring preprocessing.
Cultural and Legal Context Loss — Direct translations often omit context critical for Uzbek-specific meaning (e.g., legal or bureaucratic expressions).

8.6 Summary of Translation Data Usefulness

Model Goal	Data Type Needed	Example Dataset	Why It Helps
Basic bilingual translation	Parallel corpora	OPUS Tatoeba, WikiMatrix	Teaches direct mapping between Uzbek and English.
LLM instruction fine-tuning	Instruction-form translation pairs	Behbudiy/translation-instruction	Enables conversational translation tasks (“Translate this text…”).
Cross-lingual retrieval or RAG	Multilingual embeddings or alignment corpora	LaBSE, NLLB evaluation data	Allows combining Uzbek and English documents in one semantic space.

8.7 Outlook

To advance translation quality and integrate it with high-level reasoning (e.g., bilingual legal assistants), future projects should:

Expand professionally curated bilingual corpora with domain diversity (law, healthcare, government).
Add reasoning-style translation datasets, showing how to translate nuanced or context-heavy Uzbek expressions step-by-step.
Develop script normalization tools to handle Cyrillic↔Latin consistency.
Create translation benchmarks specific to Central Asian languages for fair evaluation (BLEU, COMET, ChrF++).

With these improvements, Uzbek can move from being a “low-resource language” to a strategically multilingual language in global AI.

9. Conclusion

Uzbek-language AI development has entered a transformative stage.
In its early years, progress centered on text collection — web crawls, literary archives, and digitized books that gave language models their foundational linguistic understanding.
Today, the next frontier lies not in more text, but in structured and functional data: resources that teach models not only how Uzbek sounds, but how Uzbek thought and reasoning operate.

Through this analysis, it is clear that:

Raw text corpora forms a solid linguistic base for pretraining.
Instruction datasets (like UAzimov/uzbek-instruct-llm and Behbudiy/alpaca-cleaned-uz) provide the first step toward task-following behavior.
Annotated datasets add evaluation capabilities for classification and named-entity recognition tasks.
Chain-of-Thought data—the foundation of reasoning and interpretability—remains a critical missing piece.
Translation datasets (e.g., Behbudiy/translation-instruction, OPUS Tatoeba, WikiMatrix) are becoming an essential bridge between Uzbek and global languages, enabling multilingual LLMs and cross-lingual retrieval systems.

The strategic direction for the Uzbek AI community is now clear:

Develop domain-specific instruction datasets (especially in law, healthcare, and education).
Create reasoning-rich Chain-of-Thought corpora that capture Uzbek analytical processes.
Expand translation and alignment data to ensure interoperability between Uzbek and other major languages.
Build national evaluation benchmarks that measure reasoning quality, factual accuracy, and multilingual performance.
Establish Uzbek GLUE benchmark dataset (UzGLUE) to train LLMs in different tasks.

By shifting focus from raw data collection to structured, high-quality, and explainable datasets, Uzbekistan can evolve from a low-resource language environment into a regional AI innovator—one that promotes ethical, transparent, and locally grounded artificial intelligence aligned with the country’s cultural and linguistic identity.

The State of Uzbek Data for AI

The State of Uzbek Data for AI

1. Introduction

2. Why Uzbek Data Matters

3. Classes of Uzbek Data

4. EDA of Open-Source Uzbek Datasets

5. How Each Class Contributes to Model Training

6. Key Findings from the EDA

7. Implications for Future Uzbek AI Development

8. Translation Data for Uzbek ↔ English

8.1 What Translation Data Is

8.2 Types of Translation Data

8.3 Overview of Available Uzbek Translation Resources

8.4 How These Datasets Are Used

a) For Traditional Machine Translation (NMT)

b) For Multilingual LLMs and Instruction Fine-Tuning

c) For Alignment and Multilingual Embeddings

8.5 Quality and Coverage Challenges

8.6 Summary of Translation Data Usefulness

8.7 Outlook

9. Conclusion

References

Table of Contents