Algorithm for detection relevant text elements based on morphological and frequency analysis

Cover Page

Cite item

Full Text

Open Access Open Access
Restricted Access Access granted
Restricted Access Subscription or Fee Access

Abstract

The main object of this work is to automate the process of detection key words and phrases using modern natural language processing methods, which will improve the structure and classification of text data, as well as adapt them for further integration with classification systems. For this purpose, algorithm for automatic detection of key words and phrases from texts in Russian language is proposed for use in working with complex multi-level classification systems such as UDC, GRNTI. This algorithm can work with single texts without linking them to collections of documents. А joint frequency and morphological analysis was used to detect keywords and phrases, take into account the structure of the document. When detection of key phrases, lexical and grammatical patterns consists of adjectives and nouns were used as well as stable combinations of nouns. The algorithm effective works with large texts divided into segments (ones of relevant). To adjust the rank of a relevant text element calculated using frequency analysis. А special coefficient is introduced that depend on the areas of occurrence of keywords. The comparative analysis showed that, in comparison with the TF-IDF and TextRank algorithms, the developed algorithm demonstrates high efficiency in detection key words. The integration of the automatic text analysis algorithm with classification systems discovers an additional opportunities to structure knowledge and to improve process efficiency the large amounts of data.

About the authors

V. M. Veselovsky

Moscow Technical University of Communications and Informatics

Author for correspondence.
Email: vladveselovskij4147@gmail.com

Student

Russian Federation, Moscow

R. F. Khalabiya

Moscow Technical University of Communications and Informatics

Email: rustam-capitan@mail.ru

Ph.D. in Engineering sciences, Associate Professor

Russian Federation, Moscow

I. V. Stepanova

Moscow Technical University of Communications and Informatics

Email: ivs_rrr@mail.ru

Ph.D. in Geology and Mineralogy Sciences, Associate Professor

Russian Federation, Moscow

References

  1. Fomin V. V. Osochkin A. A. А comparative study of the index of the frequency and morphological methods for automatic text summarisation of texts, Novye Obrazovatel’nye Strategii v Sovremennom Informatsionnom Prostranstve, 2020, pp. 189—197(in Russian).
  2. Larionov V. D. Comparison of algorithms for extracting keywords from Russian-language news articles, Zametki po Informatike i Matematike: Sbornik nauchnykh statei, Yaroslavl, Yaroslavskii gosudarstvennyi universitet im. P. G. Demidova, 2021, vol. 13, p. 118—125 (in Russian).
  3. Mokhammad Zh. Kh. Keyword extraction based on large language models, Izvestiya YuFU. Tekhnicheskie Nauki, 2024, no. 5 (241), pp. 143—151, doi: 10.18522/2311-3103-2024-5-143-151 (in Russian).
  4. Romanadze E. L., Sudakov V. A., Kislinsky V. G. Development of a Keyphrase Extraction Method Based on a Probabilistic Topic Model, Modelirovanie i Analiz Dannykh, 2022, vol. 12, no 2, pp. 20—33, doi: 10.17759/mda.2022120202 (in Russian).
  5. Ovchinnikova K. A., Sidorova E. A. Generation of lexical and syntactic patterns of ontological design based on competence assessment questions, Sistemnaya Informatika, 2022, no. 21, pp. 47—64, doi: 10.31144/SI.2307-6410.2022.N21.P47-64.
  6. Abanin D. A., Kurmyza P. S., Sherkunov V. V. Development of algorithms and tools for extracting structure and keywords from text documents, Vestnik Ul’yanovskogo Gosudarstvennogo Tekhnicheskogo Universiteta, 2022, no. 4 (100), pp. 46—51 (in Russian).
  7. Mokhammad Zh. Kh., Mansur A. M., Kravchenko Yu. А., Bova V. V. А method for extracting keywords based on a new ranking function, Informatsionnye Tekhnologii, 2022, vol. 28, no. 9, pp. 465—474, doi: 10.17587/it.28.465-474 (in Russian).
  8. Savelyev A. O., Kuznetsov S. A. Estimation of similarity of weakly structured datasets based on cosine similarity and TF- IDF, Molodezh’ i sovremennye informatsionnye tekhnologii: Sbornik trudov XVIII Mezhdunarodnoi nauchno-prakticheskoi konferentsii, Tomsk, Natsional’nyi issledovatel’skii Tomskii politekhnicheskii universitet, 2021, pp. 334—335 (in Russian).
  9. Palmov S. V., Salikhov R. R. Comparative analysis of the PYMORPHY3 and PYMYSTEM3 libraries, Nauka i Biznes: Puti Razvitiya, 2024, no. 6(156), pp. 45—49 (in Russian).
  10. Ivanova I. V., Palmina K. S. Using Python to tokenize text in sentiment analysis, Nauchnye Issledovaniya v Sovremennom Mire. Teoriya i Praktika: Sbornik izbrannykh statei Vserossiiskoi (natsional’noi) nauchno-prakticheskoi konferentsii, Saint-Petersburg, Gumanitarnyi natsional’nyi issledovatel’skii institut "NATSRAZVITIE", 2021, pp. 83—88 (in Russian).
  11. Ayoshin I. T., Fedorov V. A., Gorodov A. A., Goncharov А. E. Tokenizing words and selecting n-grams from text on natural language, Reshetnevskie chteniya: Materialy XXV Mezhdunarodnoi nauchno-prakticheskoi konferentsii, Krasnoyarsk,Sibirskii gosudarstvennyi universitet nauki i tekhnologii imeni akademika M. F. Reshetneva, 2021, vol. 2, pp. 14—16.
  12. Shklyarova E. Yu., Zemlyanskaya S. Yu. Extracting useful information from scientific publications using NLP PYTHON libraries: analysis and practical experience, Materialy XIV Mezhdunarodnoi nauchno-tekhnicheskoi konferentsii Informatika, Upravlyayushchie Sistemy, Matematicheskoe i Komp’yuternoe Modelirovanie, 2023, pp. 318—324 (in Russian).
  13. Politsyna E. V., Politsyn S. A., Porechnyi A. S., Rykunov А. N. Analysis of the quality of work and expansion of the capabilities of morphological analysis tools for texts in Russian, Vestnik VGU, Seriya: Sistemnyi analiz i Informatsionnye Tekhnologii, 2023, no. 2, pp.171—180, doi: 10.17308/sait/1995- 5499/2023/2/171-180 (in Russian).
  14. Kovalevskii P. O. Automatic text processing (lemmatization problem), Yazyk, Kul’tura, Mental’nost’: Problemy i Perspektivy Filologicheskikh Issledovanii: Sbornik IV Mezhdunarodnoi nauchnoi konferentsii, Kursk, Yugo-Zapadnyi gosudarstvennyi universitet, 2022, pp. 135—138 (in Russian).
  15. Khramtsov N. S. The problems of evaluating algorithms for automatic keyword, Novye informatsionnye tekhnologii v avtomatizirovannykh sistemakh, 2019, no. 22, pp. 199—203(in Russian).
  16. Ghukasyan Ts. G. Character N-gram-Based Word Embeddings for Morphological Analysis of Texts. Trudy ISP RAN, 2020, vol. 32, issue 2, pp. 7—14, doi: 10.15514/ISPRAS-2020-32(2)-1 (in Russian).

Supplementary files

Supplementary Files
Action
1. JATS XML

Copyright (c) 2025 Informacionnye Tehnologii



СМИ зарегистрировано Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор).
Регистрационный номер и дата принятия решения о регистрации СМИ: серия ПИ № 77 - 15565 от 02 июня 2003 г.