Modification of the method for modeling the thematic environment of terms using the LDA approach

Oleg V. Zolotarev; Золотарев Олег Васильевич; Vladimir A. Yurchak; Юрчак Владимир Александрович

doi:10.33693/2313-223X-2025-12-2-19-27

Modification of the method for modeling the thematic environment of terms using the LDA approach

作者: Zolotarev O.V.¹, Yurchak V.A.¹
隶属关系:
1. Russian New University
期: 卷 12, 编号 2 (2025)
页面: 19-27
栏目: Artificial intelligence and machine learning
URL: https://journals.eco-vector.com/2313-223X/article/view/688951
DOI: https://doi.org/10.33693/2313-223X-2025-12-2-19-27
EDN: https://elibrary.ru/QPYWFS
ID: 688951

如何引用文章

全文:

开放存取

##reader.subscriptionAccessGranted##
受限制的访问

订阅或者付费存取

详细
全文:
作者简介
参考
补充文件
统计

详细

Thematic modeling is an essential tool for analyzing large volumes of textual data, enabling the identification of latent semantic patterns. However, conventional approaches such as Latent Dirichlet Allocation (LDA) encounter difficulties when dealing with multi-valued and unigram tokens, resulting in reduced accuracy and clarity in the outcomes. This study aims to develop a technique for constructing a thematic structure based on refined LDA, which incorporates contextual features, vector representations of words, and external vocabularies. The objective is to address terminological ambiguity and enhance the clarity of thematic groups. The paper employs a mathematical model that integrates probabilistic thematic modeling with vector representations, facilitating the differentiation of word meanings and the establishment of precise connections between them. Using the corpus of Dimensions AI and PubMed publications, the study demonstrates an improved distribution of terms within thematic clusters. This involves frequency analysis and vector similarity, which are essential components of the study. The results emphasize the effectiveness of an integrated approach to dealing with complex linguistic structures in automated text analysis.

关键词

LDA method, thesauri, multivalued tokens, monolex tokens, Dimensions AI, PubMed

全文:

作者简介

Oleg Zolotarev

Russian New University

编辑信件的主要联系方式.
Email: ol-zolot@yandex.ru
ORCID iD: 0000-0001-6917-9668
SPIN 代码: 5231-7243
Scopus 作者 ID: 57203129675
Researcher ID: AAR-4461-2021

Cand. Sci. (Eng.), Associate Professor; Head, Department of Information Systems in Economics and Management

俄罗斯联邦, Moscow

Vladimir Yurchak

Russian New University

Email: yurchak.vladimir.1998@mail.ru
ORCID iD: 0000-0002-1362-802X
Researcher ID: GZG-2909-2022

postgraduate student, lecturer, Department of Information Systems in Economics and Management

俄罗斯联邦, Moscow

参考

Angelov D. Top2Vec: Distributed representations of topics. arXiv:2008.09470. 2020. URL: https://arxiv.org/abs/2008.09470 (дата обращения: 12.05.2025).
Grootendorst M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv:2203.05794. 2022. URL: https://arxiv.org/abs/2203.05794 (дата обращения: 12.05.2025).
Dieng A.B., Ruiz F.J.R., Blei D.M. Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics. 2020. Vol. 8. Pp. 439–453.
Bianchi F., Terragni S., Hovy D. Pre-training is a hot topic: Contextualized document embeddings improve topic coherence. Findings of EMNLP. 2024. Pp. 2346–2359.
Biggio M., Crippa F., Fumagalli A. et al. Joint document-token embeddings for hierarchical topic modeling. In: Contextualized-Top2Vec. Proceedings of the 2024 Conference on Neural Information Processing Systems. 2024. Pp. 10234–10246.
Bianchi F., Terragni S., Hovy D. Combined Topic Model (CTM): Integrating contextualized embeddings into LDA. In: Findings of ACL. 2021. Pp. 1175–1188.
Maheshwari K., Roberts M.E., Stewart B.M. Evaluating contextualized topic coherence for neural topic models. Journal of Machine Learning Research. 2022. Vol. 23. Pp. 1–20.
Angelov D., Inkpen D. Hierarchical topic modeling with contextual token representations. In: Contextualized-Top2Vec. Proceedings of the 2024 Conference on Neural Information Processing Systems. 2024. URL: https://github.com/ddangelov/Top2Vec (дата обращения: 12.05.2025).
Lee J., Yoon W., Kim S. et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020. Vol. 36. No. 4. Pp. 1234–1240.
Zaheer M., Guruganesh G., Dubey K.A. et al. Big Bird: Transformers for longer sequences. In: NeurIPS. 2020. Pp. 17283–17297.
Reimers N., Gurevych I. Sentence-BERT: Sentence embeddings using siamese BERT-Networks. In: Proceedings of EMNLP. 2019. Pp. 3980–3990.
Pethe M., Joshi S., Kulkarni P. et al. SciBERT: A pretrained language model for scientific text. In: Proceedings of EMNLP. 2019. Pp. 3615–3620.
McInnes L., Healy J., Melville J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426. 2018. URL: https://arxiv.org/abs/1802.03426 (дата обращения: 12.05.2025).
Fraley C., Raftery A.E. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association. 2002. Vol. 97. No. 458. Pp. 611–631.
Bodenreider O. The Unified Medical Language System (UMLS): Integrating biomedical terminology. Nucleic Acids Research. 2004. Vol. 32. Suppl. 1. Pp. D267–D270.
Lowe H.J., Barnett G.O. Understanding and using the Medical Subject Headings (MeSH) vocabulary to perform literature searches. JAMA. 1994. Vol. 271. No. 14. Pp. 1103–1108.
Zolotarev O.V., Hakimova A.Kh., Agraval S. et al. Removing terms from biomedical publications-an approach based on n-grams. In: Civilization of Knowledge: Russian realities. 2023. Pp. 136–160. EDN: IRLOBR.

补充文件

附件文件

动作

1. JATS XML

下载

2. Fig. 1. Architectural scheme of the token processing approach

下载 (232KB)

索引源数据

3. Fig. 2. An architectural diagram of the vectorization process of selected topics in data with generated thematic environments using the LSA algorithm

下载 (340KB)

索引源数据

4. Fig. 3. Thematic environment for Dimensions AI and PubMed publications based on the LDA method with selected terms from dictionaries (thesauri) excluding hyperparameters

下载 (155KB)

索引源数据

5. Fig. 4. Subject environment for Dimensions AI and PubMed publications based on the LDA method and hyperparameters α and γ from the log-likelihood maximization function (extended model)

下载 (92KB)

索引源数据

用户名
密码
记住我

忘记您的密码?	注册

用户名
密码
记住我

忘记您的密码?	注册