USING THEMATIC CLUSTERING IN MULTIMODAL DATA TO SEARCH FOR IMPLICIT CONNECTIONS AND TRENDS IN THESAURUS DEVELOPMENT
With the development of artificial intelligence and machine learning, it has become possible to use hierarchical probabilistic models in the field of natural language processing. Probabilistic or "thematic models" have made it easier to discover the underlying themes that form the content of text corpora. Thematic models have demonstrated their usefulness in analyzing a variety of content that goes beyond just textual information, including images, biological data, and survey responses. An important application of thematic modeling has been the identification of research trends).
Goal. The purpose of this study is to develop and experimentally validate a hybrid method for determining the optimal number of thematic clusters for automatic updating of specialized thesauri based on the analysis of multimodal scientific texts. The method is based on normalized assessments such as perplexity and consistency, which makes it possible to assess the quality of topics and identify implicit connections between terms within each topic. The study examines the problem of optimizing the number of topics at the junction of one or more subject areas and recording the evolution of topics with highlighting the trend for each term within each topic.
Methods. The study proposes a new approach integrating the LDA and BERTopic algorithms with an adaptive optimization function that simultaneously considers the metrics of perplexity (P) and semantic consistency (C). An original mathematical model has been developed to identify implicit relationships between terms through a combination of probabilistic and contextual similarity.
Scientific novelty of the research. This study presents a mathematical model for identifying implicit semantic links between terms, combining the likelihood of semantic and contextual similarities, which makes it possible to identify new links that are missing in thesauri. In addition, a hybrid approach is presented that combines the latent Dirichlet distribution algorithm (LDA) and BERTopic (Based on Bertopic python packaged) to determine the optimal number of thematic clusters in multimodal texts.
Results. The results of the research described in this article are the creation of a thematic model with an optimal number of topics at the junction of one or more subject areas. Using the PubMed international knowledge base for medical publications and the Dimensions AI abstract and analytical database as a basic dataset, it allowed us to trace the evolution of topics with a trend for each term within each topic and helped researchers from various industries understand the interrelationships between topics and terms in the content of multimodal texts.
Yurchak V.A. Using Thematic Clustering in Multimodal Data to Search for Implicit Connections and Trends in Thesaurus Development // Research result. Information technologies. – T.10, №4, 2025. – P. 88-104. DOI: 10.18413/2518-1092-2025-10-4-0-8
















While nobody left any comments to this publication.
You can be first.
1. Yurchak V.A. Tools for solving problems of recognition and clustering of data from documents using machine learning methods / Zolotarev O.V., Yurchak V.A. // IVD. – 2023. – No. 2(98). – 156-164 P.
2. Kornei A.O. Semantic-statistical algorithm for determining the categories of aspects in sentiment analysis problems / Kornei A.O., Kryuchkova E.N. // Bulletin of the Southern Federal University. Technical sciences. – 2020. – No. 6 (216). – 66-74 p.
3. Klimenko S.V. Using the ontological approach to analyze natural language texts / Klimenko S.V., Zolotarev O.V., Sharin M.M. // Bulletin of the Russian New University. Series: Complex Systems: Models, Analysis and Management. – 2017. – P. 67-71.
4. Khakimova A.Kh. Approaches to Creating a Multilingual Lexical Resource for Semantometric Assessment of Interlingual Semantic Similarity of Texts / Khakimova A.Kh., Zolotarev O.V., Sharnin M.M. / Nizhny Novgorod State University of Architecture and Civil Engineering, Research Center for Physics and Engineering Informatics. – 2019. – P.319-324.
5. Zolotarev O.V., Khakimova A.Kh., Sharnin M.M. Development of Methods for Intelligent Analysis of Scientific Publications to Monitor Priority Directions for the Development of Preventive and Personalized Medicine / O.V. Zolotarev, A.Kh. Khakimova, M.M. Sharnin // Bulletin of the Russian New University. Series "Complex Systems". – 2019. – P. 110-117.
6. Methodology for constructing an associative-hierarchical portrait of a subject area: hierarchy of categories / Klimenko S.V. et al. // Autonomous Non-Commercial Organization "Institute of Physical and Technical Informatics". – 2017. – P. 251-260.
7. Model and technology for extracting new terms from medical texts / Zolotarev O.V. et al. // Informatics and its Applications. – 2022. – 80-86 p.
8. The measure of similarity of texts as a tool for assessing intertextuality in the analysis of large collections of documents / Zolotarev O.V. et al. // Bulletin of the Russian New University Series: complex systems: models, analysis, and management. – 2016. – 62-71 p.
9. Program for the allocation of terms from the corpus of texts / Zolotarev O.V. et al. // Autonomous non–profit organization of Higher Education "Russian New University" – 2023. – 1-2 p.
10. A program for building a structured corpus of texts based on electronic databases of publications / Zolotarev O.V. et al. // Autonomous non–profit organization of Higher Education "Russian New University" – 2023. – 1-2 p.
11. Farea A., Tripathi Sh., Glazko G., Emmert-Streib F. Investigating the optimal number of topics by advanced text-mining techniques: Sustainable energy research // Engineering Applications of Artificial Intelligence. V. 136, part A. – 2024. URL: https://doi.org/10.1016/j.engappai.2024.108877.
12. Li Y., Wang W., Yan X., Gao M., Xiao M. Research on the Application of Semantic Network in Disease Diagnosis Prompts Based on Medical Corpus / International Journal of Innovative Research in Computer Science and Technology (IJIRCST). – 2024. – 1-9 p. Available from: https://doi.org/10.55524/ijircst.2024.12.2.1
13. Bruches E.P. Methods and algorithms for recognizing and linking entities for building systems for automatic extraction of information from scientific texts: dis. for the degree of candidate of technical sciences. – Novosibirsk: Federal State Budgetary Institution of Science Institute of Informatics Systems named after Ershov, 2021. – 112 p.
14. Dudarin P.V. Research and development of models and methods for fuzzy clustering of short texts: dis. for the degree of candidate of technical sciences. – Ulyanovsk: "Ulyanovsk State Technical University", 2021. – 141 p.
15. Tutubalina E.V. Models and methods of automatic processing of unstructured data in the biomedical field: dis Doctor of Computer Science – Kazan: Kazan (Volga Region) Federal University, 2023 – 225 p.
16. Korney A.O. Medical Methods and scientific algorithms of the aspect analysis table for determining tonality based on the corpus of a hybrid presented semantic-statistical model of prompts of a natural language: dis. for the abso academic UMLS degree of Candidate of International Technical Semantic Sciences. – Barnaul: allocation of the Federal semantic state budgetary foreign educational institution of higher international education "International Altai State Medical Technical University named after I.I. Polzunov", the consequence of 2021. – 134 p.