Modification of the method for modeling the thematic environment of terms using the LDA approach
- 作者: Zolotarev O.V.1, Yurchak V.A.1
-
隶属关系:
- Russian New University
- 期: 卷 12, 编号 2 (2025)
- 页面: 19-27
- 栏目: Artificial intelligence and machine learning
- URL: https://journals.eco-vector.com/2313-223X/article/view/688951
- DOI: https://doi.org/10.33693/2313-223X-2025-12-2-19-27
- EDN: https://elibrary.ru/QPYWFS
- ID: 688951
如何引用文章
详细
Thematic modeling is an essential tool for analyzing large volumes of textual data, enabling the identification of latent semantic patterns. However, conventional approaches such as Latent Dirichlet Allocation (LDA) encounter difficulties when dealing with multi-valued and unigram tokens, resulting in reduced accuracy and clarity in the outcomes. This study aims to develop a technique for constructing a thematic structure based on refined LDA, which incorporates contextual features, vector representations of words, and external vocabularies. The objective is to address terminological ambiguity and enhance the clarity of thematic groups. The paper employs a mathematical model that integrates probabilistic thematic modeling with vector representations, facilitating the differentiation of word meanings and the establishment of precise connections between them. Using the corpus of Dimensions AI and PubMed publications, the study demonstrates an improved distribution of terms within thematic clusters. This involves frequency analysis and vector similarity, which are essential components of the study. The results emphasize the effectiveness of an integrated approach to dealing with complex linguistic structures in automated text analysis.
全文:

作者简介
Oleg Zolotarev
Russian New University
编辑信件的主要联系方式.
Email: ol-zolot@yandex.ru
ORCID iD: 0000-0001-6917-9668
SPIN 代码: 5231-7243
Scopus 作者 ID: 57203129675
Researcher ID: AAR-4461-2021
Cand. Sci. (Eng.), Associate Professor; Head, Department of Information Systems in Economics and Management
俄罗斯联邦, MoscowVladimir Yurchak
Russian New University
Email: yurchak.vladimir.1998@mail.ru
ORCID iD: 0000-0002-1362-802X
Researcher ID: GZG-2909-2022
postgraduate student, lecturer, Department of Information Systems in Economics and Management
俄罗斯联邦, Moscow参考
- Angelov D. Top2Vec: Distributed representations of topics. arXiv:2008.09470. 2020. URL: https://arxiv.org/abs/2008.09470 (дата обращения: 12.05.2025).
- Grootendorst M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv:2203.05794. 2022. URL: https://arxiv.org/abs/2203.05794 (дата обращения: 12.05.2025).
- Dieng A.B., Ruiz F.J.R., Blei D.M. Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics. 2020. Vol. 8. Pp. 439–453.
- Bianchi F., Terragni S., Hovy D. Pre-training is a hot topic: Contextualized document embeddings improve topic coherence. Findings of EMNLP. 2024. Pp. 2346–2359.
- Biggio M., Crippa F., Fumagalli A. et al. Joint document-token embeddings for hierarchical topic modeling. In: Contextualized-Top2Vec. Proceedings of the 2024 Conference on Neural Information Processing Systems. 2024. Pp. 10234–10246.
- Bianchi F., Terragni S., Hovy D. Combined Topic Model (CTM): Integrating contextualized embeddings into LDA. In: Findings of ACL. 2021. Pp. 1175–1188.
- Maheshwari K., Roberts M.E., Stewart B.M. Evaluating contextualized topic coherence for neural topic models. Journal of Machine Learning Research. 2022. Vol. 23. Pp. 1–20.
- Angelov D., Inkpen D. Hierarchical topic modeling with contextual token representations. In: Contextualized-Top2Vec. Proceedings of the 2024 Conference on Neural Information Processing Systems. 2024. URL: https://github.com/ddangelov/Top2Vec (дата обращения: 12.05.2025).
- Lee J., Yoon W., Kim S. et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020. Vol. 36. No. 4. Pp. 1234–1240.
- Zaheer M., Guruganesh G., Dubey K.A. et al. Big Bird: Transformers for longer sequences. In: NeurIPS. 2020. Pp. 17283–17297.
- Reimers N., Gurevych I. Sentence-BERT: Sentence embeddings using siamese BERT-Networks. In: Proceedings of EMNLP. 2019. Pp. 3980–3990.
- Pethe M., Joshi S., Kulkarni P. et al. SciBERT: A pretrained language model for scientific text. In: Proceedings of EMNLP. 2019. Pp. 3615–3620.
- McInnes L., Healy J., Melville J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426. 2018. URL: https://arxiv.org/abs/1802.03426 (дата обращения: 12.05.2025).
- Fraley C., Raftery A.E. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association. 2002. Vol. 97. No. 458. Pp. 611–631.
- Bodenreider O. The Unified Medical Language System (UMLS): Integrating biomedical terminology. Nucleic Acids Research. 2004. Vol. 32. Suppl. 1. Pp. D267–D270.
- Lowe H.J., Barnett G.O. Understanding and using the Medical Subject Headings (MeSH) vocabulary to perform literature searches. JAMA. 1994. Vol. 271. No. 14. Pp. 1103–1108.
- Zolotarev O.V., Hakimova A.Kh., Agraval S. et al. Removing terms from biomedical publications-an approach based on n-grams. In: Civilization of Knowledge: Russian realities. 2023. Pp. 136–160. EDN: IRLOBR.
补充文件
