<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root>
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ali="http://www.niso.org/schemas/ali/1.0/" article-type="research-article" dtd-version="1.2" xml:lang="en"><front><journal-meta><journal-id journal-id-type="publisher-id">Infokommunikacionnye tehnologii</journal-id><journal-title-group><journal-title xml:lang="en">Infokommunikacionnye tehnologii</journal-title><trans-title-group xml:lang="ru"><trans-title>Инфокоммуникационные технологии</trans-title></trans-title-group></journal-title-group><issn publication-format="print">2073-3909</issn><publisher><publisher-name xml:lang="en">Povolzhskiy State University of Telecommunications and Informatics</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="publisher-id">689828</article-id><article-id pub-id-type="doi">10.18469/ikt.2024.22.1.13</article-id><article-categories><subj-group subj-group-type="toc-heading" xml:lang="en"><subject>New information technologies</subject></subj-group><subj-group subj-group-type="toc-heading" xml:lang="ru"><subject>Новые информационные технологии</subject></subj-group><subj-group subj-group-type="article-type"><subject>Research Article</subject></subj-group></article-categories><title-group><article-title xml:lang="en">Development of a natural language processing tool for solving the application problem of extracting statistical data from text</article-title><trans-title-group xml:lang="ru"><trans-title>Разработка инструмента обработки естественного языка для решения прикладной задачи извлечения статистических данных из текста</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author"><name-alternatives><name xml:lang="en"><surname>Zakharova</surname><given-names>O. I.</given-names></name><name xml:lang="ru"><surname>Захарова</surname><given-names>О. И.</given-names></name></name-alternatives><address><country country="RU">Russian Federation</country></address><bio xml:lang="en"><p>Associated Professor of Information Systems and Technologies Department, Deputy Head of the Research Laboratory of Artificial Intelligence, PhD in Technical Science</p></bio><bio xml:lang="ru"><p>к.т.н., доцент, доцент кафедры информационных систем и технологий (ИСТ), заместитель заведующего научно-исследовательской лабораторией искусственного интеллекта (НИЛ ИИ)</p></bio><email>o.zaharova@psuti.ru</email><xref ref-type="aff" rid="aff1"/></contrib><contrib contrib-type="author"><name-alternatives><name xml:lang="en"><surname>Bednyak</surname><given-names>S. G.</given-names></name><name xml:lang="ru"><surname>Бедняк</surname><given-names>С. Г.</given-names></name></name-alternatives><address><country country="RU">Russian Federation</country></address><bio xml:lang="en"><p>Associated Professor of Information Systems and Technologies Department, PhD in Pedagogical Sciences</p></bio><bio xml:lang="ru"><p>к.п.н., доцент, доцент кафедры ИСТ</p></bio><email>s.bednyak@psuti.ru</email><xref ref-type="aff" rid="aff1"/></contrib></contrib-group><aff-alternatives id="aff1"><aff><institution xml:lang="en">Povolzhskiy State University of Telecommunications and Informatics</institution></aff><aff><institution xml:lang="ru">Поволжский государственный университет телекоммуникаций и информатики</institution></aff></aff-alternatives><pub-date date-type="pub" iso-8601-date="2025-03-09" publication-format="electronic"><day>09</day><month>03</month><year>2025</year></pub-date><volume>22</volume><issue>1</issue><issue-title xml:lang="en"/><issue-title xml:lang="ru"/><fpage>93</fpage><lpage>102</lpage><history><date date-type="received" iso-8601-date="2025-08-23"><day>23</day><month>08</month><year>2025</year></date><date date-type="accepted" iso-8601-date="2025-08-23"><day>23</day><month>08</month><year>2025</year></date></history><permissions><copyright-statement xml:lang="en">Copyright ©; 2025, Zakharova O.I., Bednyak S.G.</copyright-statement><copyright-statement xml:lang="ru">Copyright ©; 2025, Захарова О.И., Бедняк С.Г.</copyright-statement><copyright-year>2025</copyright-year><copyright-holder xml:lang="en">Zakharova O.I., Bednyak S.G.</copyright-holder><copyright-holder xml:lang="ru">Захарова О.И., Бедняк С.Г.</copyright-holder><ali:free_to_read xmlns:ali="http://www.niso.org/schemas/ali/1.0/"/><license><ali:license_ref xmlns:ali="http://www.niso.org/schemas/ali/1.0/">https://creativecommons.org/licenses/by-nc-nd/4.0</ali:license_ref></license></permissions><self-uri xlink:href="https://journals.eco-vector.com/2073-3909/article/view/689828">https://journals.eco-vector.com/2073-3909/article/view/689828</self-uri><abstract xml:lang="en"><p>Text analytics is used to explore textual content and obtain new variables from raw text, which can be used as input data for forecasting models or other statistical methods, including for solving fundamental problems. The purpose of the research: to analyze machine learning algorithms, practical developments in this field and to develop an integrated software instrument for text processing, using the structure of the algorithm, based on the BasicStats, ReadabilityStats, SovChLit libraries, allowing to extract statistics from raw texts of large volumes in Russian. A method of extracting sta-tistical data from raw texts of large volumes based on machine learning and natural language processing in Python has been implemented, with the possibility of embedding it into other projects. A software instrument that use the functionality of textary library adapted for Russian language was developed, which allows to work with both texts and Doc-objects generated with spaCY library. The study was conducted using real text data collected from the information and news portal for the Samara region «63.ru» (in the context of the implementation of the conceptual project «Data Farm» by the artificial intelligence research laboratory). The developed software for extracting statistical data from text allows analyzing large volumes of text data and extracting useful information from them. It can be integrated into other software solutions as one of the linking modules in the of code optimization chain for text data processing programs.</p></abstract><trans-abstract xml:lang="ru"><p>Текстовая аналитика используется для изучения текстового содержимого и получения новых переменных из необработанного текста, которые можно использовать в качестве входных данных для моделей прогнозирования или других статистических методов, в том числе при решении фундаментальных задач. Цель исследования: проанализировать алгоритмы машинного обучения, практические наработки в этой области и разработать интегрируемый программный инструмент обработки текста, используя структуру алгоритма, на основе библиотек BasicStats, ReadabilityStats, SovChLit, позволяющий извлекать статистику из текстов большого объема на русском языке. Реализован метод извлечения статистических данных из необработанных текстов больших объемов на основе машинного обучения и обработки естественного языка на языке Python, с возможностью встраивания в другие проекты. Разработан программный инструмент, использующий функционал адаптированной для русского языка библиотеки textary, который позволяет работать как с текстами, так и с Doc-объектами, подготовленными с помощью библиотеки spaCY. Для проведения исследования были задействованы реальные текстовые данные, собранные с информационно-новостного портала по Самарской области «63.ru» (в рамках реализации концептуального проекта «Ферма данных» научно-исследовательской лаборатории искусственного интеллекта). Разработанный программный инструмент извлечения статистических данных из текста позволяет анализировать большие объемы текстовых данных и извлекать из них полезную информацию. Его можно интегрировать в другие программные решения, как один из связующих модулей в цепи оптимизации кода для программ по обработке текстовых данных.</p></trans-abstract><funding-group><funding-statement xml:lang="en">Natural Language Processing; natural language processing algorithm; text processing; statistical extraction; machine learning; Python</funding-statement><funding-statement xml:lang="ru">Natural Language Processing; алгоритм обработки естественного языка; обработка текста; извлечение статистических данных; машинное обучение; Python</funding-statement></funding-group></article-meta></front><body></body><back><ref-list><ref id="B1"><label>1.</label><citation-alternatives><mixed-citation xml:lang="en">Zakharova O.I. Development of a system for analysis and processing of text data. Problemy tekhniki i tekhnologij telekommunikacij (PTiTT-2023): materialy XXV Mezhdunarodnoj nauchno-tekhnicheskoj konferencii. Kazan’: KNITU-KAI, 2023, pp. 261–262. (In Russ.)</mixed-citation><mixed-citation xml:lang="ru">Захарова О.И. Разработка системы анализа и обработки текстовых данных // Проблемы техники и технологий телекоммуникаций (ПТиТТ-2023): материалы XXV Международной научно-технической конференции. Казань: КНИТУ-КАИ, 2023. С. 261–262.</mixed-citation></citation-alternatives></ref><ref id="B2"><label>2.</label><citation-alternatives><mixed-citation xml:lang="en">Kuleshov S.V., Zaitseva A.A., Levashkin S.P. Technologies and principles of unstructured distributed data processing in the context of modern media content providing. Informatizaciya i svyaz’, 2020, no. 5. pp. 22–28. DOI: 10.34219/2078-8320-2020-11-5-22-28 (In Russ.)</mixed-citation><mixed-citation xml:lang="ru">Кулешов С.В., Зайцева А.А., Левашкин С.П. Технологии и принципы сбора и обработки неструктурированных распределенных данных с учетом современных особенностей предоставления медиа-контента // Информатизация и связь. 2020. № 5. С. 22–28. DOI: 10.34219/2078-8320-2020-11-5-22-28</mixed-citation></citation-alternatives></ref><ref id="B3"><label>3.</label><citation-alternatives><mixed-citation xml:lang="en">Zakharova O.I. Semantic analysis and synthesis of text data. Vestnik Voronezhskogo gosudarstvennogo universiteta. Seriya: Sistemnyj analiz i informacionnye tekhnologii, 2023, no. 4, pp. 182–208. DOI: 10.17308/sait/1995-5499/2023/4/182-208 (In Russ.)</mixed-citation><mixed-citation xml:lang="ru">Захарова О.И. Семантический анализ и синтез текстовых данных. Вестник Воронежского государственного университета. Серия: Системный анализ и информационные технологии. 2023. № 4. С. 182–208. DOI: 10.17308/sait/1995-5499/2023/4/182-208</mixed-citation></citation-alternatives></ref><ref id="B4"><label>4.</label><citation-alternatives><mixed-citation xml:lang="en">Smetanin S., Komarov M. Deep transfer learning baselines for sentiment analysis in Russian. Information Processing &amp; Management, 2021, vol. 58, no. 3, pp. 102484. DOI:10.1016/j.ipm.2020.102484. URL: https://www.sci-hub.ru/10.1016/j.ipm.2020.102484 (accessed: 28.06.2024).</mixed-citation><mixed-citation xml:lang="ru">Smetanin S., Komarov M. Deep transfer learning baselines for sentiment analysis in Russian // Information Processing &amp; Management. 2021. Vol. 58, no. 3. P. 102484. DOI:10.1016/j.ipm.2020.102484. URL: https://www.sci-hub.ru/10.1016/j.ipm.2020.102484 (дата обращения: 28.06.2024).</mixed-citation></citation-alternatives></ref><ref id="B5"><label>5.</label><citation-alternatives><mixed-citation xml:lang="en">Shavrina T.O. Methods of computational linguistics in the evaluation of artificial intelligence systems. Voprosy yazykoznaniya, 2021, no. 6, pp. 117–138. DOI: 10.31857/0373-658X.2021.6.117-138 (In Russ.)</mixed-citation><mixed-citation xml:lang="ru">Шаврина Т.О. О методах компьютерной лингвистики в оценке систем искусственного интеллекта // Вопросы языкознания. 2021. № 6. С. 117–138. DOI: 10.31857/0373-658X.2021.6.117-138</mixed-citation></citation-alternatives></ref><ref id="B6"><label>6.</label><citation-alternatives><mixed-citation xml:lang="en">Leshchinskaya N.M., Kolesnik M.A. Implementation of artificial intelligence technologies in Russia. Sociologiya iskusstvennogo intellekta, 2023, vol. 4, no. 2, pp. 63–72. (In Russ.)</mixed-citation><mixed-citation xml:lang="ru">Лещинская Н.М., Колесник М.А. Внедрение технологий искусственного интеллекта в России // Социология искусственного интеллекта. 2023. Т. 4, № 2. С. 63–72.</mixed-citation></citation-alternatives></ref><ref id="B7"><label>7.</label><citation-alternatives><mixed-citation xml:lang="en">Hartmann J. et al. Comparing automated text classification methods. International Journals of Research Marketing, 2019, vol. 36, no. 1, pp. 20–36. DOI:10.1016/j.ijresmar.2018.09.009. URL: https://www.sci-hub.ru/10.1016/j.ijresmar.2018.09.009 (accessed: 20.07.2024).</mixed-citation><mixed-citation xml:lang="ru">Comparing automated text classification methods / J. Hartmann [et al.] // International Journals of Research Marketing. 2019. Vol. 36, no. 1. P. 20–36. DOI:10.1016/j.ijresmar.2018.09.009. URL: https://www.sci-hub.ru/10.1016/j.ijresmar.2018.09.009 (дата обращения: 20.07.2024).</mixed-citation></citation-alternatives></ref><ref id="B8"><label>8.</label><citation-alternatives><mixed-citation xml:lang="en">Liu Y. et al. A robustly optimized BERT pretraining approach. URL: https://arxiv.org/pdf/1907.11692.pdf (accessed: 28.07.2024).</mixed-citation><mixed-citation xml:lang="ru">A robustly optimized BERT pretraining approach / Y. Liu [et al.]. URL: https://arxiv.org/pdf/1907.11692.pdf (дата обращения: 28.07.2024).</mixed-citation></citation-alternatives></ref><ref id="B9"><label>9.</label><citation-alternatives><mixed-citation xml:lang="en">Zakharova O.I., Levashkin S.P., Ivanov K.N. Modern Python libraries for collecting data from the Internet. Problemy tekhniki i tekhnologij telekommunikacij (PTiTT-2020): materialy XXII Mezhdunarodnoj nauchno-tekhnicheskoj konferencii, Samara: PSUTI, 2020, pp. 316–317. (In Russ.)</mixed-citation><mixed-citation xml:lang="ru">Захарова О.И., Левашкин С.П., Иванов К.Н. Современные библиотеки Python для сбора данных из интернета // Проблемы техники и технологий телекоммуникаций (ПТиТТ-2020): материалы XXII Международной научно-технической конференции. Самара: ПГУТИ, 2020. С. 316–317.</mixed-citation></citation-alternatives></ref><ref id="B10"><label>10.</label><citation-alternatives><mixed-citation xml:lang="en">Ansari G.J. et al. A novel machine learning approach for scene text extraction. Future Generation Computer Systems, 2018, vol. 87, pp. 328–340. DOI: 10.1016/J.FUTURE.2018.04.074</mixed-citation><mixed-citation xml:lang="ru">A novel machine learning approach for scene text extraction / G.J. Ansari [et al.] // Future Generation Computer Systems. 2018. Vol. 87. P. 328–340. DOI: 10.1016/J.FUTURE.2018.04.074</mixed-citation></citation-alternatives></ref><ref id="B11"><label>11.</label><citation-alternatives><mixed-citation xml:lang="en">Kim H. et al. Towards perfect text classification with Wikipedia-based semantic Naïve Bayes learning. Neurocomputing, 2018, vol. 315, pp. 128–134. DOI: 10.1016/J.NEUCOM.2018.07.002</mixed-citation><mixed-citation xml:lang="ru">Towards perfect text classification with Wikipedia-based semantic Naïve Bayes learning / H. Kim [et al.] // Neurocomputing. 2018. Vol. 315. P. 128–134. DOI: 10.1016/J.NEUCOM.2018.07.002</mixed-citation></citation-alternatives></ref><ref id="B12"><label>12.</label><citation-alternatives><mixed-citation xml:lang="en">Kratzwald B. et al. Deep learning for affective computing: Text-based emotion recognition in decision support. Decision Support Systems, 2018, vol. 115, pp. 24–35. DOI: 10.1016/J.DSS.2018.09.002</mixed-citation><mixed-citation xml:lang="ru">Deep learning for affective computing: Textbased emotion recognition in decision support / B. Kratzwald [et al.] // Decision Support Systems. 2018. Vol. 115. P. 24–35. DOI: 10.1016/J.DSS.2018.09.002</mixed-citation></citation-alternatives></ref><ref id="B13"><label>13.</label><citation-alternatives><mixed-citation xml:lang="en">Taylor E.M. et al. Web opinion mining and sentimental analysis. Advanced Techniques in Web Intelligence-2, 2013, pp. 105–126. DOI: 10.1007/978-3-642- 33326-2_5</mixed-citation><mixed-citation xml:lang="ru">Web opinion mining and sentimental analysis / E.M. Taylor [et al.] // Advanced Techniques in Web Intelligence-2. P. 105–126. DOI: 10.1007/978-3-642-33326-2_5</mixed-citation></citation-alternatives></ref><ref id="B14"><label>14.</label><citation-alternatives><mixed-citation xml:lang="en">Wang Z. et al. A hybrid model of sentimental entity recognition on mobile social media. EURASIP Journal on Wireless Communications and Networking. DOI: 10.1186/s13638-016-0745-7. URL: https://sci-hub.ru/10.1186/s13638-016-0745-7 (accessed: 25.08.2024).</mixed-citation><mixed-citation xml:lang="ru">A hybrid model of sentimental entity recognition on mobile social media / Z. Wang [et al.] // EURASIP Journal on Wireless Communications and Networking. DOI: 10.1186/s13638-016-0745-7. URL: https://sci-hub.ru/10.1186/s13638-016-0745-7 (дата обращения: 25.08.2024).</mixed-citation></citation-alternatives></ref><ref id="B15"><label>15.</label><citation-alternatives><mixed-citation xml:lang="en">Altınel B., Ganiz M.C. Semantic text classification: A survey of past and recent advances. Information Processing &amp; Managemen, 2018, vol. 54, no. 6, pp. 1129–1153. DOI: 10.1016/J.IPM.2018.08.001</mixed-citation><mixed-citation xml:lang="ru">Altınel B., Ganiz M.C. Semantic text classification: A survey of past and recent advances // Information Processing &amp; Managemen. 2018. Vol. 54, no. 6. P. 1129–1153. DOI: 10.1016/J.IPM.2018.08.001</mixed-citation></citation-alternatives></ref><ref id="B16"><label>16.</label><citation-alternatives><mixed-citation xml:lang="en">Chatterjee A. et al. Understanding emotions in text using deep learning and big data. Computers in Human Behavior, 2019, vol. 93, pp. 309–317. DOI: 10.1016/J.CHB.2018.12.029</mixed-citation><mixed-citation xml:lang="ru">Understanding emotions in text using deep learning and big data / A. Chatterjee [et al.] // Computers in Human Behavior. 2019. Vol. 93. P. 309–317. DOI: 10.1016/J.CHB.2018.12.029</mixed-citation></citation-alternatives></ref><ref id="B17"><label>17.</label><citation-alternatives><mixed-citation xml:lang="en">Mazharul S.M. et al. Sentiment analysis of tweet data. Hoque Chowdhury. URL: https://www.researchgate.net/publication/324965434_SENTIMENT_ANALYSIS_OF_TWEET_DATA (accessed: 27.4.2024).</mixed-citation><mixed-citation xml:lang="ru">Sentiment analysis of tweet data / S.M. Mazharul [et al.] // Hoque Chowdhury. URL: https:// www.researchgate.net/publication/324965434_SENTIMENT_ANALYSIS_OF_TWEET_DATA (дата обращения: 27.4.2024).</mixed-citation></citation-alternatives></ref><ref id="B18"><label>18.</label><citation-alternatives><mixed-citation xml:lang="en">Allahyari M. et al. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. URL: https://www.researchgate.net/publication/318336890_A_Brief_Survey_of_Text_Mining_Classification_Clustering_and_Extraction_Techniques (accessed: 30.08.2024).</mixed-citation><mixed-citation xml:lang="ru">A brief survey of text mining: classification, clustering and extraction techniques / M. Allahyari [et al.]. URL: https://www.researchgate.net/publication/318336890_A_Brief_Survey_of_Text_Mining_Classification_Clustering_and_Extraction_Techniques (дата обращения: 30.08.2024).</mixed-citation></citation-alternatives></ref><ref id="B19"><label>19.</label><citation-alternatives><mixed-citation xml:lang="en">Hedderich M.A. et al. A survey on recent approaches for natural language processing in low-resource scenarios. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 2545–2568.</mixed-citation><mixed-citation xml:lang="ru">A survey on recent approaches for natural language processing in low-resource scenarios / M.A. Hedderich [et al.] // Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. P. 2545–2568.</mixed-citation></citation-alternatives></ref><ref id="B20"><label>20.</label><citation-alternatives><mixed-citation xml:lang="en">Popovski G., Seljak B.K., Eftimov T. A survey of named-entity recognition methods for food information extraction. IEEE Access, 2020, vol. 8, pp. 31586–31594. DOI: 10.1109/ACCESS.2020.2973502</mixed-citation><mixed-citation xml:lang="ru">Popovski G., Seljak B.K., Eftimov T. A survey of named-entity recognition methods for food information extraction // IEEE Access. 2020. Vol. 8. P. 31586–31594. DOI: 10.1109/ACCESS.2020.2973502</mixed-citation></citation-alternatives></ref><ref id="B21"><label>21.</label><citation-alternatives><mixed-citation xml:lang="en">Devlin J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding. URL: https://arxiv.org/pdf/1810.04805 (accessed: 29.08.2024).</mixed-citation><mixed-citation xml:lang="ru">Bert: Pre-training of deep bidirectional transformers for language understanding / J. Devlin [et al.]. URL: https://arxiv.org/pdf/1810.04805 (дата обращения: 29.08.2024).</mixed-citation></citation-alternatives></ref><ref id="B22"><label>22.</label><citation-alternatives><mixed-citation xml:lang="en">Canete J. et al. Spatish pre-trained BERT model and evaluation data. Accepted as a workshop paper at PML4DC (ICLR). URL: https://www.researchgate.net/publication/372962444_Spanish_Pre-trained_BERT_Model_and_Evaluation_Data (accessed: 20.08.2024).</mixed-citation><mixed-citation xml:lang="ru">Spatish pre-trained BERT model and evaluation data / J. Canete [et al.] // Accepted as a workshop paper at PML4DC (ICLR). URL: https://www.researchgate.net/publication/372962444_Spanish_Pre-trained_BERT_Model_and_Evaluation_Data (дата обращения: 20.08.2024).</mixed-citation></citation-alternatives></ref><ref id="B23"><label>23.</label><citation-alternatives><mixed-citation xml:lang="en">Ivanov K.N., Zakharova O.I. Natural language processing. application of language models. Aktual’nye problemy informatiki, radiotekhniki i svyazi: materialy XXX Rossijskoj nauchnotekhnicheskoj konferencii. Samara: PSUTI, 2023, pp. 155–156. (In Russ.)</mixed-citation><mixed-citation xml:lang="ru">Иванов К.Н., Захарова О.И. Обработка естественного языка. Применение языковых моделей // Актуальные проблемы информатики, радиотехники и связи: материалы XXX Российской научно-технической конференции. Самара: ПГУТИ, 2023. С. 155–156.</mixed-citation></citation-alternatives></ref></ref-list></back></article>
