KFU. Card publication. Developing the Tajik language in the era of large language models: corpus infrastructure, linguistic challenges, and safety alignment

- RUS
- 中文
- ES

DEVELOPING THE TAJIK LANGUAGE IN THE ERA OF LARGE LANGUAGE MODELS: CORPUS INFRASTRUCTURE, LINGUISTIC CHALLENGES, AND SAFETY ALIGNMENT

Form of presentation

Articles in Russian journals and collections

Year of publication

2025

��

��

Authors, employees of KFU

Arabov Mullosharaf Kurbonovich, author

Bibliographic description in the original language

Arabov, M. K. Developing the Tajik language in the era of large language models: corpus infrastructure, linguistic challenges, and safety alignment / M. K. Arabov // Modern Science. – 2025. – No. 12-2. – P. 85-93. – EDN LQLURB.

Annotation

The rapid progress of large language models (LLMs) has reshaped natural language processing, yet this progress has reinforced existing inequalities between high-resource and low-resource languages. Tajik, despite its long-standing literary tradition and official status, remains largely absent from contemporary LLM ecosystems. At the present stage, the language lacks publicly accessible, standardised and computationally usable corpora and datasets suitable for training, adaptation or evaluation of modern language models. Although a National Corpus of the Tajik Language is often cited, its internal structure, annotation formats and access conditions do not allow its effective use in reproducible NLP research. This paper adopts a theoretical and infrastructural perspective and analyses the structural reasons for this situation. The study identifies three interrelated domains that constrain the development of Tajik LLM technologies: data availability and quality, linguistic representation, and research infrastructure. Particular attention is paid to the discrepancy between classical linguistic proximity and functional technological compatibility, especially with respect to cross-lingual transfer from Persian. The paper does not present new datasets or empirical experiments; instead, it formulates a conceptual framework and preparatory research agenda intended to guide future corpus construction, linguistic preprocessing and safety-aware model adaptation for the Tajik language.

Keywords

TAJIK LANGUAGE, LARGE LANGUAGE MODELS, LOW-RESOURCE LANGUAGES, CORPUS INFRASTRUCTURE, MORPHOLOGICAL RICHNESS, TOKENISATION, CODE-SWITCHING, LANGUAGE SAFETY, DETOXIFICATION, DIGITAL INEQUALITY

The name of the journal

MODERN SCIENCE

Please use this ID to quote from or refer to the card

https://repository.kpfu.ru/eng/?p_id=323443&p_lang=2

Resource files

File name	Size (MB)	Format
elibrary_87881993_55071084.pdf	0,17	pdf	show / download

Full metadata record

Field DC	Value	Language
dc.contributor.author	Arabov Mullosharaf Kurbonovich	ru_RU
dc.date.accessioned	2025-01-01T00:00:00Z	ru_RU
dc.date.available	2025-01-01T00:00:00Z	ru_RU
dc.date.issued	2025	ru_RU
dc.identifier.citation	Arabov, M. K. Developing the Tajik language in the era of large language models: corpus infrastructure, linguistic challenges, and safety alignment / M. K. Arabov // Modern Science. – 2025. – No. 12-2. – P. 85-93. – EDN LQLURB.	ru_RU
dc.identifier.uri	https://repository.kpfu.ru/eng/?p_id=323443&p_lang=2	ru_RU
dc.description.abstract	MODERN SCIENCE	ru_RU
dc.description.abstract	The rapid progress of large language models (LLMs) has reshaped natural language processing, yet this progress has reinforced existing inequalities between high-resource and low-resource languages. Tajik, despite its long-standing literary tradition and official status, remains largely absent from contemporary LLM ecosystems. At the present stage, the language lacks publicly accessible, standardised and computationally usable corpora and datasets suitable for training, adaptation or evaluation of modern language models. Although a National Corpus of the Tajik Language is often cited, its internal structure, annotation formats and access conditions do not allow its effective use in reproducible NLP research. This paper adopts a theoretical and infrastructural perspective and analyses the structural reasons for this situation. The study identifies three interrelated domains that constrain the development of Tajik LLM technologies: data availability and quality, linguistic representation, and research infrastructure. Particular attention is paid to the discrepancy between classical linguistic proximity and functional technological compatibility, especially with respect to cross-lingual transfer from Persian. The paper does not present new datasets or empirical experiments; instead, it formulates a conceptual framework and preparatory research agenda intended to guide future corpus construction, linguistic preprocessing and safety-aware model adaptation for the Tajik language.	ru_RU
dc.language.iso	ru	ru_RU
dc.subject	TAJIK LANGUAGE	ru_RU
dc.subject	LARGE LANGUAGE MODELS	ru_RU
dc.subject	LOW-RESOURCE LANGUAGES	ru_RU
dc.subject	CORPUS INFRASTRUCTURE	ru_RU
dc.subject	MORPHOLOGICAL RICHNESS	ru_RU
dc.subject	TOKENISATION	ru_RU
dc.subject	CODE-SWITCHING	ru_RU
dc.subject	LANGUAGE SAFETY	ru_RU
dc.subject	DETOXIFICATION	ru_RU
dc.subject	DIGITAL INEQUALITY	ru_RU
dc.title	Developing the Tajik language in the era of large language models: corpus infrastructure, linguistic challenges, and safety alignment	ru_RU
dc.type	Articles in Russian journals and collections	ru_RU