Development of a natural language processing tool for solving the application problem of extracting statistical data from text
- 作者: Zakharova O.I.1, Bednyak S.G.1
-
隶属关系:
- Povolzhskiy State University of Telecommunications and Informatics
- 期: 卷 22, 编号 1 (2024)
- 页面: 93-102
- 栏目: New information technologies
- URL: https://journals.eco-vector.com/2073-3909/article/view/689828
- DOI: https://doi.org/10.18469/ikt.2024.22.1.13
- ID: 689828
如何引用文章
全文:
详细
Text analytics is used to explore textual content and obtain new variables from raw text, which can be used as input data for forecasting models or other statistical methods, including for solving fundamental problems. The purpose of the research: to analyze machine learning algorithms, practical developments in this field and to develop an integrated software instrument for text processing, using the structure of the algorithm, based on the BasicStats, ReadabilityStats, SovChLit libraries, allowing to extract statistics from raw texts of large volumes in Russian. A method of extracting sta-tistical data from raw texts of large volumes based on machine learning and natural language processing in Python has been implemented, with the possibility of embedding it into other projects. A software instrument that use the functionality of textary library adapted for Russian language was developed, which allows to work with both texts and Doc-objects generated with spaCY library. The study was conducted using real text data collected from the information and news portal for the Samara region «63.ru» (in the context of the implementation of the conceptual project «Data Farm» by the artificial intelligence research laboratory). The developed software for extracting statistical data from text allows analyzing large volumes of text data and extracting useful information from them. It can be integrated into other software solutions as one of the linking modules in the of code optimization chain for text data processing programs.
作者简介
O. Zakharova
Povolzhskiy State University of Telecommunications and Informatics
编辑信件的主要联系方式.
Email: o.zaharova@psuti.ru
Associated Professor of Information Systems and Technologies Department, Deputy Head of the Research Laboratory of Artificial Intelligence, PhD in Technical Science
俄罗斯联邦, SamaraS. Bednyak
Povolzhskiy State University of Telecommunications and Informatics
Email: s.bednyak@psuti.ru
Associated Professor of Information Systems and Technologies Department, PhD in Pedagogical Sciences
俄罗斯联邦, Samara参考
- Zakharova O.I. Development of a system for analysis and processing of text data. Problemy tekhniki i tekhnologij telekommunikacij (PTiTT-2023): materialy XXV Mezhdunarodnoj nauchno-tekhnicheskoj konferencii. Kazan’: KNITU-KAI, 2023, pp. 261–262. (In Russ.)
- Kuleshov S.V., Zaitseva A.A., Levashkin S.P. Technologies and principles of unstructured distributed data processing in the context of modern media content providing. Informatizaciya i svyaz’, 2020, no. 5. pp. 22–28. doi: 10.34219/2078-8320-2020-11-5-22-28 (In Russ.)
- Zakharova O.I. Semantic analysis and synthesis of text data. Vestnik Voronezhskogo gosudarstvennogo universiteta. Seriya: Sistemnyj analiz i informacionnye tekhnologii, 2023, no. 4, pp. 182–208. doi: 10.17308/sait/1995-5499/2023/4/182-208 (In Russ.)
- Smetanin S., Komarov M. Deep transfer learning baselines for sentiment analysis in Russian. Information Processing & Management, 2021, vol. 58, no. 3, pp. 102484. doi: 10.1016/j.ipm.2020.102484. URL: https://www.sci-hub.ru/10.1016/j.ipm.2020.102484 (accessed: 28.06.2024).
- Shavrina T.O. Methods of computational linguistics in the evaluation of artificial intelligence systems. Voprosy yazykoznaniya, 2021, no. 6, pp. 117–138. doi: 10.31857/0373-658X.2021.6.117-138 (In Russ.)
- Leshchinskaya N.M., Kolesnik M.A. Implementation of artificial intelligence technologies in Russia. Sociologiya iskusstvennogo intellekta, 2023, vol. 4, no. 2, pp. 63–72. (In Russ.)
- Hartmann J. et al. Comparing automated text classification methods. International Journals of Research Marketing, 2019, vol. 36, no. 1, pp. 20–36. doi: 10.1016/j.ijresmar.2018.09.009. URL: https://www.sci-hub.ru/10.1016/j.ijresmar.2018.09.009 (accessed: 20.07.2024).
- Liu Y. et al. A robustly optimized BERT pretraining approach. URL: https://arxiv.org/pdf/1907.11692.pdf (accessed: 28.07.2024).
- Zakharova O.I., Levashkin S.P., Ivanov K.N. Modern Python libraries for collecting data from the Internet. Problemy tekhniki i tekhnologij telekommunikacij (PTiTT-2020): materialy XXII Mezhdunarodnoj nauchno-tekhnicheskoj konferencii, Samara: PSUTI, 2020, pp. 316–317. (In Russ.)
- Ansari G.J. et al. A novel machine learning approach for scene text extraction. Future Generation Computer Systems, 2018, vol. 87, pp. 328–340. doi: 10.1016/J.FUTURE.2018.04.074
- Kim H. et al. Towards perfect text classification with Wikipedia-based semantic Naïve Bayes learning. Neurocomputing, 2018, vol. 315, pp. 128–134. doi: 10.1016/J.NEUCOM.2018.07.002
- Kratzwald B. et al. Deep learning for affective computing: Text-based emotion recognition in decision support. Decision Support Systems, 2018, vol. 115, pp. 24–35. doi: 10.1016/J.DSS.2018.09.002
- Taylor E.M. et al. Web opinion mining and sentimental analysis. Advanced Techniques in Web Intelligence-2, 2013, pp. 105–126. doi: 10.1007/978-3-642- 33326-2_5
- Wang Z. et al. A hybrid model of sentimental entity recognition on mobile social media. EURASIP Journal on Wireless Communications and Networking. doi: 10.1186/s13638-016-0745-7. URL: https://sci-hub.ru/10.1186/s13638-016-0745-7 (accessed: 25.08.2024).
- Altınel B., Ganiz M.C. Semantic text classification: A survey of past and recent advances. Information Processing & Managemen, 2018, vol. 54, no. 6, pp. 1129–1153. doi: 10.1016/J.IPM.2018.08.001
- Chatterjee A. et al. Understanding emotions in text using deep learning and big data. Computers in Human Behavior, 2019, vol. 93, pp. 309–317. doi: 10.1016/J.CHB.2018.12.029
- Mazharul S.M. et al. Sentiment analysis of tweet data. Hoque Chowdhury. URL: https://www.researchgate.net/publication/324965434_SENTIMENT_ANALYSIS_OF_TWEET_DATA (accessed: 27.4.2024).
- Allahyari M. et al. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. URL: https://www.researchgate.net/publication/318336890_A_Brief_Survey_of_Text_Mining_Classification_Clustering_and_Extraction_Techniques (accessed: 30.08.2024).
- Hedderich M.A. et al. A survey on recent approaches for natural language processing in low-resource scenarios. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 2545–2568.
- Popovski G., Seljak B.K., Eftimov T. A survey of named-entity recognition methods for food information extraction. IEEE Access, 2020, vol. 8, pp. 31586–31594. doi: 10.1109/ACCESS.2020.2973502
- Devlin J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding. URL: https://arxiv.org/pdf/1810.04805 (accessed: 29.08.2024).
- Canete J. et al. Spatish pre-trained BERT model and evaluation data. Accepted as a workshop paper at PML4DC (ICLR). URL: https://www.researchgate.net/publication/372962444_Spanish_Pre-trained_BERT_Model_and_Evaluation_Data (accessed: 20.08.2024).
- Ivanov K.N., Zakharova O.I. Natural language processing. application of language models. Aktual’nye problemy informatiki, radiotekhniki i svyazi: materialy XXX Rossijskoj nauchnotekhnicheskoj konferencii. Samara: PSUTI, 2023, pp. 155–156. (In Russ.)
补充文件
