16+
DOI: 10.18413/2518-1092-2025-10-4-0-8

USING THEMATIC CLUSTERING IN MULTIMODAL DATA TO SEARCH FOR IMPLICIT CONNECTIONS AND TRENDS IN THESAURUS DEVELOPMENT

With the development of artificial intelligence and machine learning, it has become possible to use hierarchical probabilistic models in the field of natural language processing. Probabilistic or "thematic models" have made it easier to discover the underlying themes that form the content of text corpora. Thematic models have demonstrated their usefulness in analyzing a variety of content that goes beyond just textual information, including images, biological data, and survey responses. An important application of thematic modeling has been the identification of research trends).

Goal. The purpose of this study is to develop and experimentally validate a hybrid method for determining the optimal number of thematic clusters for automatic updating of specialized thesauri based on the analysis of multimodal scientific texts. The method is based on normalized assessments such as perplexity and consistency, which makes it possible to assess the quality of topics and identify implicit connections between terms within each topic.  The study examines the problem of optimizing the number of topics at the junction of one or more subject areas and recording the evolution of topics with highlighting the trend for each term within each topic.

Methods. The study proposes a new approach integrating the LDA and BERTopic algorithms with an adaptive optimization function that simultaneously considers the metrics of perplexity (P) and semantic consistency (C). An original mathematical model has been developed to identify implicit relationships between terms through a combination of probabilistic and contextual similarity.

Scientific novelty of the research. This study presents a mathematical model for identifying implicit semantic links between terms, combining the likelihood of semantic and contextual similarities, which makes it possible to identify new links that are missing in thesauri. In addition, a hybrid approach is presented that combines the latent Dirichlet distribution algorithm (LDA) and BERTopic (Based on Bertopic python packaged) to determine the optimal number of thematic clusters in multimodal texts.

Results. The results of the research described in this article are the creation of a thematic model with an optimal number of topics at the junction of one or more subject areas. Using the PubMed international knowledge base for medical publications and the Dimensions AI abstract and analytical database as a basic dataset, it allowed us to trace the evolution of topics with a trend for each term within each topic and helped researchers from various industries understand the interrelationships between topics and terms in the content of multimodal texts.

Number of views: 20 (view statistics)
Количество скачиваний: 44
Full text (PDF)Скачать XMLTo articles list
  • User comments
  • Reference lists

While nobody left any comments to this publication.
You can be first.

Leave comment: