Kazan (Volga region) Federal University, KFU
KAZAN
FEDERAL UNIVERSITY
 
DEVELOPING THE TAJIK LANGUAGE IN THE ERA OF LARGE LANGUAGE MODELS: CORPUS INFRASTRUCTURE, LINGUISTIC CHALLENGES, AND SAFETY ALIGNMENT
Form of presentationArticles in Russian journals and collections
Year of publication2025
Языканглийский
  • Arabov Mullosharaf Kurbonovich, author
  • Bibliographic description in the original language Arabov, M. K. Developing the Tajik language in the era of large language models: corpus infrastructure, linguistic challenges, and safety alignment / M. K. Arabov // Modern Science. – 2025. – No. 12-2. – P. 85-93. – EDN LQLURB.
    Annotation The rapid progress of large language models (LLMs) has reshaped natural language processing, yet this progress has reinforced existing inequalities between high-resource and low-resource languages. Tajik, despite its long-standing literary tradition and official status, remains largely absent from contemporary LLM ecosystems. At the present stage, the language lacks publicly accessible, standardised and computationally usable corpora and datasets suitable for training, adaptation or evaluation of modern language models. Although a National Corpus of the Tajik Language is often cited, its internal structure, annotation formats and access conditions do not allow its effective use in reproducible NLP research. This paper adopts a theoretical and infrastructural perspective and analyses the structural reasons for this situation. The study identifies three interrelated domains that constrain the development of Tajik LLM technologies: data availability and quality, linguistic representation, and research infrastructure. Particular attention is paid to the discrepancy between classical linguistic proximity and functional technological compatibility, especially with respect to cross-lingual transfer from Persian. The paper does not present new datasets or empirical experiments; instead, it formulates a conceptual framework and preparatory research agenda intended to guide future corpus construction, linguistic preprocessing and safety-aware model adaptation for the Tajik language.
    Keywords TAJIK LANGUAGE, LARGE LANGUAGE MODELS, LOW-RESOURCE LANGUAGES, CORPUS INFRASTRUCTURE, MORPHOLOGICAL RICHNESS, TOKENISATION, CODE-SWITCHING, LANGUAGE SAFETY, DETOXIFICATION, DIGITAL INEQUALITY
    The name of the journal MODERN SCIENCE
    Please use this ID to quote from or refer to the card https://repository.kpfu.ru/eng/?p_id=323443&p_lang=2
    Resource files 
    File name Size (MB) Format  
    elibrary_87881993_55071084.pdf 0,17 pdf show / download

    Full metadata record