| Form of presentation | Articles in Russian journals and collections |
| Year of publication | 2025 |
| Язык | английский |
|
Arabov Mullosharaf Kurbonovich, author
|
| Bibliographic description in the original language |
Arabov, M. K. Developing the Tajik language in the era of large language models: corpus infrastructure, linguistic challenges, and safety alignment / M. K. Arabov // Modern Science. – 2025. – No. 12-2. – P. 85-93. – EDN LQLURB. |
| Annotation |
The rapid progress of large language models (LLMs) has reshaped natural language processing, yet this progress has reinforced existing inequalities between high-resource and low-resource languages. Tajik, despite its long-standing literary tradition and official status, remains largely absent from contemporary LLM ecosystems. At the present stage, the language lacks publicly accessible, standardised and computationally usable corpora and datasets suitable for training, adaptation or evaluation of modern language models. Although a National Corpus of the Tajik Language is often cited, its internal structure, annotation formats and access conditions do not allow its effective use in reproducible NLP research. This paper adopts a theoretical and infrastructural perspective and analyses the structural reasons for this situation. The study identifies three interrelated domains that constrain the development of Tajik LLM technologies: data availability and quality, linguistic representation, and research infrastructure. Particular attention is paid to the discrepancy between classical linguistic proximity and functional technological compatibility, especially with respect to cross-lingual transfer from Persian. The paper does not present new datasets or empirical experiments; instead, it formulates a conceptual framework and preparatory research agenda intended to guide future corpus construction, linguistic preprocessing and safety-aware model adaptation for the Tajik language. |
| Keywords |
TAJIK LANGUAGE, LARGE LANGUAGE MODELS, LOW-RESOURCE LANGUAGES, CORPUS INFRASTRUCTURE, MORPHOLOGICAL RICHNESS, TOKENISATION, CODE-SWITCHING, LANGUAGE SAFETY, DETOXIFICATION, DIGITAL INEQUALITY |
| The name of the journal |
MODERN SCIENCE
|
| Please use this ID to quote from or refer to the card |
https://repository.kpfu.ru/eng/?p_id=323443&p_lang=2 |
| Resource files | |
|
|
Full metadata record  |
| Field DC |
Value |
Language |
| dc.contributor.author |
Arabov Mullosharaf Kurbonovich |
ru_RU |
| dc.date.accessioned |
2025-01-01T00:00:00Z |
ru_RU |
| dc.date.available |
2025-01-01T00:00:00Z |
ru_RU |
| dc.date.issued |
2025 |
ru_RU |
| dc.identifier.citation |
Arabov, M. K. Developing the Tajik language in the era of large language models: corpus infrastructure, linguistic challenges, and safety alignment / M. K. Arabov // Modern Science. – 2025. – No. 12-2. – P. 85-93. – EDN LQLURB. |
ru_RU |
| dc.identifier.uri |
https://repository.kpfu.ru/eng/?p_id=323443&p_lang=2 |
ru_RU |
| dc.description.abstract |
MODERN SCIENCE |
ru_RU |
| dc.description.abstract |
The rapid progress of large language models (LLMs) has reshaped natural language processing, yet this progress has reinforced existing inequalities between high-resource and low-resource languages. Tajik, despite its long-standing literary tradition and official status, remains largely absent from contemporary LLM ecosystems. At the present stage, the language lacks publicly accessible, standardised and computationally usable corpora and datasets suitable for training, adaptation or evaluation of modern language models. Although a National Corpus of the Tajik Language is often cited, its internal structure, annotation formats and access conditions do not allow its effective use in reproducible NLP research. This paper adopts a theoretical and infrastructural perspective and analyses the structural reasons for this situation. The study identifies three interrelated domains that constrain the development of Tajik LLM technologies: data availability and quality, linguistic representation, and research infrastructure. Particular attention is paid to the discrepancy between classical linguistic proximity and functional technological compatibility, especially with respect to cross-lingual transfer from Persian. The paper does not present new datasets or empirical experiments; instead, it formulates a conceptual framework and preparatory research agenda intended to guide future corpus construction, linguistic preprocessing and safety-aware model adaptation for the Tajik language. |
ru_RU |
| dc.language.iso |
ru |
ru_RU |
| dc.subject |
TAJIK LANGUAGE |
ru_RU |
| dc.subject |
LARGE LANGUAGE MODELS |
ru_RU |
| dc.subject |
LOW-RESOURCE LANGUAGES |
ru_RU |
| dc.subject |
CORPUS INFRASTRUCTURE |
ru_RU |
| dc.subject |
MORPHOLOGICAL RICHNESS |
ru_RU |
| dc.subject |
TOKENISATION |
ru_RU |
| dc.subject |
CODE-SWITCHING |
ru_RU |
| dc.subject |
LANGUAGE SAFETY |
ru_RU |
| dc.subject |
DETOXIFICATION |
ru_RU |
| dc.subject |
DIGITAL INEQUALITY |
ru_RU |
| dc.title |
Developing the Tajik language in the era of large language models: corpus infrastructure, linguistic challenges, and safety alignment |
ru_RU |
| dc.type |
Articles in Russian journals and collections |
ru_RU |
|