Development of a natural language processing tool for solving the application problem of extracting statistical data from text

封面

如何引用文章

全文:

详细

Text analytics is used to explore textual content and obtain new variables from raw text, which can be used as input data for forecasting models or other statistical methods, including for solving fundamental problems. The purpose of the research: to analyze machine learning algorithms, practical developments in this field and to develop an integrated software instrument for text processing, using the structure of the algorithm, based on the BasicStats, ReadabilityStats, SovChLit libraries, allowing to extract statistics from raw texts of large volumes in Russian. A method of extracting sta-tistical data from raw texts of large volumes based on machine learning and natural language processing in Python has been implemented, with the possibility of embedding it into other projects. A software instrument that use the functionality of textary library adapted for Russian language was developed, which allows to work with both texts and Doc-objects generated with spaCY library. The study was conducted using real text data collected from the information and news portal for the Samara region «63.ru» (in the context of the implementation of the conceptual project «Data Farm» by the artificial intelligence research laboratory). The developed software for extracting statistical data from text allows analyzing large volumes of text data and extracting useful information from them. It can be integrated into other software solutions as one of the linking modules in the of code optimization chain for text data processing programs.

作者简介

O. Zakharova

Povolzhskiy State University of Telecommunications and Informatics

编辑信件的主要联系方式.
Email: o.zaharova@psuti.ru

Associated Professor of Information Systems and Technologies Department, Deputy Head of the Research Laboratory of Artificial Intelligence, PhD in Technical Science

俄罗斯联邦, Samara

S. Bednyak

Povolzhskiy State University of Telecommunications and Informatics

Email: s.bednyak@psuti.ru

Associated Professor of Information Systems and Technologies Department, PhD in Pedagogical Sciences

俄罗斯联邦, Samara

参考

  1. Zakharova O.I. Development of a system for analysis and processing of text data. Problemy tekhniki i tekhnologij telekommunikacij (PTiTT-2023): materialy XXV Mezhdunarodnoj nauchno-tekhnicheskoj konferencii. Kazan’: KNITU-KAI, 2023, pp. 261–262. (In Russ.)
  2. Kuleshov S.V., Zaitseva A.A., Levashkin S.P. Technologies and principles of unstructured distributed data processing in the context of modern media content providing. Informatizaciya i svyaz’, 2020, no. 5. pp. 22–28. doi: 10.34219/2078-8320-2020-11-5-22-28 (In Russ.)
  3. Zakharova O.I. Semantic analysis and synthesis of text data. Vestnik Voronezhskogo gosudarstvennogo universiteta. Seriya: Sistemnyj analiz i informacionnye tekhnologii, 2023, no. 4, pp. 182–208. doi: 10.17308/sait/1995-5499/2023/4/182-208 (In Russ.)
  4. Smetanin S., Komarov M. Deep transfer learning baselines for sentiment analysis in Russian. Information Processing & Management, 2021, vol. 58, no. 3, pp. 102484. doi: 10.1016/j.ipm.2020.102484. URL: https://www.sci-hub.ru/10.1016/j.ipm.2020.102484 (accessed: 28.06.2024).
  5. Shavrina T.O. Methods of computational linguistics in the evaluation of artificial intelligence systems. Voprosy yazykoznaniya, 2021, no. 6, pp. 117–138. doi: 10.31857/0373-658X.2021.6.117-138 (In Russ.)
  6. Leshchinskaya N.M., Kolesnik M.A. Implementation of artificial intelligence technologies in Russia. Sociologiya iskusstvennogo intellekta, 2023, vol. 4, no. 2, pp. 63–72. (In Russ.)
  7. Hartmann J. et al. Comparing automated text classification methods. International Journals of Research Marketing, 2019, vol. 36, no. 1, pp. 20–36. doi: 10.1016/j.ijresmar.2018.09.009. URL: https://www.sci-hub.ru/10.1016/j.ijresmar.2018.09.009 (accessed: 20.07.2024).
  8. Liu Y. et al. A robustly optimized BERT pretraining approach. URL: https://arxiv.org/pdf/1907.11692.pdf (accessed: 28.07.2024).
  9. Zakharova O.I., Levashkin S.P., Ivanov K.N. Modern Python libraries for collecting data from the Internet. Problemy tekhniki i tekhnologij telekommunikacij (PTiTT-2020): materialy XXII Mezhdunarodnoj nauchno-tekhnicheskoj konferencii, Samara: PSUTI, 2020, pp. 316–317. (In Russ.)
  10. Ansari G.J. et al. A novel machine learning approach for scene text extraction. Future Generation Computer Systems, 2018, vol. 87, pp. 328–340. doi: 10.1016/J.FUTURE.2018.04.074
  11. Kim H. et al. Towards perfect text classification with Wikipedia-based semantic Naïve Bayes learning. Neurocomputing, 2018, vol. 315, pp. 128–134. doi: 10.1016/J.NEUCOM.2018.07.002
  12. Kratzwald B. et al. Deep learning for affective computing: Text-based emotion recognition in decision support. Decision Support Systems, 2018, vol. 115, pp. 24–35. doi: 10.1016/J.DSS.2018.09.002
  13. Taylor E.M. et al. Web opinion mining and sentimental analysis. Advanced Techniques in Web Intelligence-2, 2013, pp. 105–126. doi: 10.1007/978-3-642- 33326-2_5
  14. Wang Z. et al. A hybrid model of sentimental entity recognition on mobile social media. EURASIP Journal on Wireless Communications and Networking. doi: 10.1186/s13638-016-0745-7. URL: https://sci-hub.ru/10.1186/s13638-016-0745-7 (accessed: 25.08.2024).
  15. Altınel B., Ganiz M.C. Semantic text classification: A survey of past and recent advances. Information Processing & Managemen, 2018, vol. 54, no. 6, pp. 1129–1153. doi: 10.1016/J.IPM.2018.08.001
  16. Chatterjee A. et al. Understanding emotions in text using deep learning and big data. Computers in Human Behavior, 2019, vol. 93, pp. 309–317. doi: 10.1016/J.CHB.2018.12.029
  17. Mazharul S.M. et al. Sentiment analysis of tweet data. Hoque Chowdhury. URL: https://www.researchgate.net/publication/324965434_SENTIMENT_ANALYSIS_OF_TWEET_DATA (accessed: 27.4.2024).
  18. Allahyari M. et al. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. URL: https://www.researchgate.net/publication/318336890_A_Brief_Survey_of_Text_Mining_Classification_Clustering_and_Extraction_Techniques (accessed: 30.08.2024).
  19. Hedderich M.A. et al. A survey on recent approaches for natural language processing in low-resource scenarios. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 2545–2568.
  20. Popovski G., Seljak B.K., Eftimov T. A survey of named-entity recognition methods for food information extraction. IEEE Access, 2020, vol. 8, pp. 31586–31594. doi: 10.1109/ACCESS.2020.2973502
  21. Devlin J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding. URL: https://arxiv.org/pdf/1810.04805 (accessed: 29.08.2024).
  22. Canete J. et al. Spatish pre-trained BERT model and evaluation data. Accepted as a workshop paper at PML4DC (ICLR). URL: https://www.researchgate.net/publication/372962444_Spanish_Pre-trained_BERT_Model_and_Evaluation_Data (accessed: 20.08.2024).
  23. Ivanov K.N., Zakharova O.I. Natural language processing. application of language models. Aktual’nye problemy informatiki, radiotekhniki i svyazi: materialy XXX Rossijskoj nauchnotekhnicheskoj konferencii. Samara: PSUTI, 2023, pp. 155–156. (In Russ.)

补充文件

附件文件
动作
1. JATS XML

版权所有 © Zakharova O.I., Bednyak S.G., 2025

Creative Commons License
此作品已接受知识共享署名-非商业性使用-禁止演绎 4.0国际许可协议的许可。