Kazan (Volga region) Federal University, KFU
KAZAN
FEDERAL UNIVERSITY
 
PROBABILITY ANALYSIS OF THE VOCABULARY SIZE DYNAMICS USING GOOGLE BOOKS NGRAM CORPUS
Form of presentationArticles in international journals and collections
Year of publication2018
Языканглийский
  • Bochkarev Vladimir Vladimirovich, author
  • Maslennikova Yuliya Sergeevna, author
  • Bibliographic description in the original language Pekina A, Maslennikova Y, Bochkarev V., Probability analysis of the vocabulary size dynamics using google books ngram corpus//CEUR Workshop Proceedings. - 2018. - Vol.2268, Is.. - P.202-207.
    Annotation The article introduces a method for determining a rate of appearance of new words in a language. The method is based on probabilistic estimates of the vocabulary size of a large text corpus. Backward predicted frequencies of rare words are estimated using linear models that are optimized by the maxi-mum likelihood criteria. This approach provides more accurate estimations of frequencies for the earlier periods; the lower the frequency of the word during the analyzed period, the higher the benefit. A posteriori estimates of the fre-quency probability of appearance of new words were used to clarify the vo-cabulary size for different years and rate of appearance of new words. Accord-ing to the proposed probabilistic model, it was shown that >30% of investigated English and Russian word were appeared in the language before the moment when they were identified in the Google Books Ngram Corpus.
    Keywords Word usage frequencies, prediction, Google Books Ngram
    The name of the journal CEUR Workshop Proceedings
    URL https://www.scopus.com/inward/record.uri?eid=2-s2.0-85058976417&partnerID=40&md5=9a11efbe0d7295409759459f5ff2e650
    Please use this ID to quote from or refer to the card https://repository.kpfu.ru/eng/?p_id=194049&p_lang=2

    Full metadata record