Extraction of physical and technical information from text documents

Cover Page

Cite item

Full Text

Open Access Open Access
Restricted Access Access granted
Restricted Access Subscription or Fee Access

Abstract

The relevance of the study is due to the need to automate the analysis of text documents containing descriptions of physical and technical effects. In the context of modern development of science and technology, the volume of scientific articles, patent documents and grant reports is rapidly increasing, which requires effective methods for extracting and analyzing such key data. The theoretical significance of the work lies in the development of a new method for automatic extraction of physical and technical data in the form of keyphrases from natural-language text documents, ensuring the cooperation of deep learning technologies and methods of semantic-ontological text analysis. The practical significance of the work lies in the creation of a software for automatic extraction of elements of physical and technical effects from natural-language texts. The corpora of sentences (more than 4.3 thousand) was formed from the texts of patents containing physical and technical structured information in the form of descriptions of physical effects, solved technical problems. Neural network models keyT5, T5 and Bert were trained to extract physical and technical information. The T5 and KeyT5 models demonstrated high results in extracting keyphrases in the form of elements of descriptions of physical and technical effects (precision over 0.94, recall over 0.95).

Full Text

Restricted Access

About the authors

D. M. Korobkin

Volgograd State Technical University

Author for correspondence.
Email: dkorobkin80@mail.ru

Ph.D., Assistant Professor

Russian Federation, Volgograd

References

  1. Korobkin D. M., Fomenkov S. A., Davydova S. V. Search of Physical Effect descriptions in global patent space, Vestnik komp’iuternykh i informatsionnykh tekhnologii, 2016, no. 5, pp. 3—11 (in Russian).
  2. Korobkin D. M., Shabanov D. V., Fomenkov S. A., Dvorjankin А. M. The software for formation the matrix of technical functions performed by physical effects based on patent database analysis, Modeling, optimization and information technology, 2020, vol. 8, no. 4 (31), 12 p., doi: 10.26102/2310-6018/2020.31.4.006. (in Russian).
  3. Mamedov V. Y., Kovalevsky D. A., Morozov D. A., Stolyarov S. S., Ospichev S. S. Hierarchical classification of scientific articles using deep learning (using the UDC hierarchy as an example), Modeling and Analysis of Information Systems, 2025, vol. 32, no. 1, pp. 80—94, doi: 10.18255/1818-1015-2025-1-80-94 (in Russian).
  4. Kusakin I. K., Fedorets O. V., Romanov А. Y. Classification of Short Scientific Texts, Scientific and Technical Information Processing, 2023, vol. 50, no. 3, pp. 176—183, doi: 10.36535/0548-0019-2023-07-3.
  5. Fedotova A., Kurtukova A., Romanov A., Shelupanov А. Semantic Clustering and Transfer Learning in Social Media Texts Authorship Attribution, In IEEE Access, 2024, vol. 12, pp. 39783—39803, doi: 10.1109/ACCESS.2024.3377231.
  6. Marshalova A. E., Bruches E. P., Batura T. V. Aspect extraction from scientific paper texts, Software & Systems, 2022, no 4 (in Russian).
  7. Vasiliev D. D., Pyataeva А. V. T5 language models for text simplification, Software & Systems, 2023, no. 2 (in Russian).
  8. Ermolenko T. V. Classification of Errors in the Text Based on Deep Learning, Problems of artificial intelligence, 2019, no. 3 (14) (in Russian).
  9. Gabín J., Ares M., Parapar J. Enhancing Automatic Keyphrase Labelling with Text-to-Text Transfer Transformer (T5) Architecture: А Framework for Keyphrase Generation and Filtering, Advances in Information Retrieval: 46th European Conference on IR Research (ECIR 2024), Glasgow, UK, 2024, pp. 267—275, doi: 10.1007/978-3-031-56027-9_18.
  10. Chopra S., Agarwal P., Ahmed J., Biswas S., Obaid А. Roberta and BERT: Revolutionizing Mental Healthcare Through Natural Language, SN Computer Science, 2024, no. 5, doi: 10.1007/s42979-024-03202-8.
  11. Huang Q. Research on Keywords Extraction of Film Reviews Based on the KeyBERT Model. Transactions on Computer Science and Intelligent Systems Research, 2024, no. 5, pp. 732—738, doi: 10.62051/1zpndy68.
  12. Campos R., Mangaravite V., Pasquali A., Jorge A., Nunes C., Jatowt A. YAKE! Collection-Independent Automatic Keyword Extractor, Proceedings of the 40th European Conference on Information Retrieval (ECIR 2018), Grenoble, France, 2018, pp. 806—810, doi: 10.1007/978-3-319-76941-7_80.
  13. Ivaschenko A., Stolbova A., Krupin D., Krivosheev A., Sitnikov P., Kravets O. Semantic analysis implementation in engineering enterprise content management systems, Proceedings of the 2023 IEEE 17th International Conference on Application of Information and Communication Technologies (AICT), Moscow, 2023, pp. 1—5, doi: 10.1109/AICT59525.2023.10313055.
  14. Surdeanu M., Valenzuela-Escárcega M. A. Using Transformers with the Hugging Face Library, Deep Learning for Natural Language Processing: А Gentle Introduction, Ed. by M. Surdeanu, M. A. Valenzuela-Escárcega, Cambridge, Cambridge University Press, 2024, pp. 194—215, doi: 10.1017/9781009026222.014.
  15. Goloviznina V. S., Kotelnikov E. V. Automatic Summarization of Russian Texts: Comparison of Extractive and Abstractive Methods, Computational Linguistics and Intellectual Technologies, 2022, pp. 223—235.
  16. Schmitt X., Kubler S., Robert J., Papadakis M., LeTraon Y. А Replicable Comparison Study of NER Software: StanfordNLP, NLTK, OpenNLP, SpaCy, Gate, 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), Granada, Spain, 2019, pp. 338—343, doi: 10.1109/SNAMS.2019.8931850.
  17. Ovchinnikova K. A., Ivanov A. I., Sidorova E. A. Information extraction from texts based on ontology and large language models, System Informatics, 2023, no. 23, pp. 13—32, doi: 10.31144/SI.2307-6410 (in Russian).
  18. Goutte C., Gaussier E. А Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation, Advances in Information Retrieval: 27th European Conference on Information Retrieval Research (ECIR 2005), Santiago de Compostela, Spain, 2005, pp. 345—359, doi: 10.1007/978-3-540-31865-1_25.

Supplementary files

Supplementary Files
Action
1. JATS XML
2. Fig. 1. Algorithm for compiling the training dataset

Download (249KB)
3. Fig. 2. Algorithm for training/fine-tuning the KeyT5 and T5 models

Download (490KB)
4. Fig. 3. Algorithm for compiling the training dataset for Bert

Download (412KB)
5. Fig. 4. Algorithm for training the Bert model

Download (501KB)

Copyright (c) 2026 Informacionnye Tehnologii



СМИ зарегистрировано Федеральной службой по надзору в сфере связи, информационных технологий и массовых коммуникаций (Роскомнадзор).
Регистрационный номер и дата принятия решения о регистрации СМИ: серия ПИ № 77 - 15565 от 02 июня 2003 г.