<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd">
<article article-type="research-article" dtd-version="1.2" xml:lang="ru" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><front><journal-meta><journal-id journal-id-type="issn">2518-1092</journal-id><journal-title-group><journal-title>Research result. Information technologies</journal-title></journal-title-group><issn pub-type="epub">2518-1092</issn></journal-meta><article-meta><article-id pub-id-type="doi">10.18413/2518-1092-2025-10-4-0-8</article-id><article-id pub-id-type="publisher-id">4018</article-id><article-categories><subj-group subj-group-type="heading"><subject>ARTIFICIAL INTELLIGENCE AND DECISION MAKING</subject></subj-group></article-categories><title-group><article-title>&lt;strong&gt;USING THEMATIC CLUSTERING IN MULTIMODAL DATA TO SEARCH FOR IMPLICIT CONNECTIONS&amp;nbsp;AND TRENDS IN THESAURUS DEVELOPMENT&lt;/strong&gt;</article-title><trans-title-group xml:lang="en"><trans-title>&lt;strong&gt;USING THEMATIC CLUSTERING IN MULTIMODAL DATA TO SEARCH FOR IMPLICIT CONNECTIONS&amp;nbsp;AND TRENDS IN THESAURUS DEVELOPMENT&lt;/strong&gt;</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author"><name-alternatives><name xml:lang="ru"><surname>Yurchak</surname><given-names>Vladimir Alexandrovich</given-names></name><name xml:lang="en"><surname>Yurchak</surname><given-names>Vladimir Alexandrovich</given-names></name></name-alternatives><email>rabota_pres14@rambler.ru</email></contrib></contrib-group><pub-date pub-type="epub"><year>2025</year></pub-date><volume>10</volume><issue>4</issue><fpage>0</fpage><lpage>0</lpage><self-uri content-type="pdf" xlink:href="/media/information/2025/4/ИТ_НР_10_4_8.pdf" /><abstract xml:lang="ru"><p>With the development of artificial intelligence and machine learning, it has become possible to use hierarchical probabilistic models in the field of natural language processing. Probabilistic or &amp;quot;thematic models&amp;quot; have made it easier to discover the underlying themes that form the content of text corpora. Thematic models have demonstrated their usefulness in analyzing a variety of content that goes beyond just textual information, including images, biological data, and survey responses. An important application of thematic modeling has been the identification of research trends).

Goal. The purpose of this study is to develop and experimentally validate a hybrid method for determining the optimal number of thematic clusters for automatic updating of specialized thesauri based on the analysis of multimodal scientific texts. The method is based on normalized assessments such as perplexity and consistency, which makes it possible to assess the quality of topics and identify implicit connections between terms within each topic.&amp;nbsp; The study examines the problem of optimizing the number of topics at the junction of one or more subject areas and recording the evolution of topics with highlighting the trend for each term within each topic.

Methods. The study proposes a new approach integrating the LDA and BERTopic algorithms with an adaptive optimization function that simultaneously considers the metrics of perplexity (P) and semantic consistency (C). An original mathematical model has been developed to identify implicit relationships between terms through a combination of probabilistic and contextual similarity.

Scientific novelty of the research. This study presents a mathematical model for identifying implicit semantic links between terms, combining the likelihood of semantic and contextual similarities, which makes it possible to identify new links that are missing in thesauri. In addition, a hybrid approach is presented that combines the latent Dirichlet distribution algorithm (LDA) and BERTopic (Based on Bertopic python packaged) to determine the optimal number of thematic clusters in multimodal texts.

Results. The results of the research described in this article are the creation of a thematic model with an optimal number of topics at the junction of one or more subject areas. Using the PubMed international knowledge base for medical publications and the Dimensions AI abstract and analytical database as a basic dataset, it allowed us to trace the evolution of topics with a trend for each term within each topic and helped researchers from various industries understand the interrelationships between topics and terms in the content of multimodal texts.</p></abstract><trans-abstract xml:lang="en"><p>With the development of artificial intelligence and machine learning, it has become possible to use hierarchical probabilistic models in the field of natural language processing. Probabilistic or &amp;quot;thematic models&amp;quot; have made it easier to discover the underlying themes that form the content of text corpora. Thematic models have demonstrated their usefulness in analyzing a variety of content that goes beyond just textual information, including images, biological data, and survey responses. An important application of thematic modeling has been the identification of research trends).

Goal. The purpose of this study is to develop and experimentally validate a hybrid method for determining the optimal number of thematic clusters for automatic updating of specialized thesauri based on the analysis of multimodal scientific texts. The method is based on normalized assessments such as perplexity and consistency, which makes it possible to assess the quality of topics and identify implicit connections between terms within each topic.&amp;nbsp; The study examines the problem of optimizing the number of topics at the junction of one or more subject areas and recording the evolution of topics with highlighting the trend for each term within each topic.

Methods. The study proposes a new approach integrating the LDA and BERTopic algorithms with an adaptive optimization function that simultaneously considers the metrics of perplexity (P) and semantic consistency (C). An original mathematical model has been developed to identify implicit relationships between terms through a combination of probabilistic and contextual similarity.

Scientific novelty of the research. This study presents a mathematical model for identifying implicit semantic links between terms, combining the likelihood of semantic and contextual similarities, which makes it possible to identify new links that are missing in thesauri. In addition, a hybrid approach is presented that combines the latent Dirichlet distribution algorithm (LDA) and BERTopic (Based on Bertopic python packaged) to determine the optimal number of thematic clusters in multimodal texts.

Results. The results of the research described in this article are the creation of a thematic model with an optimal number of topics at the junction of one or more subject areas. Using the PubMed international knowledge base for medical publications and the Dimensions AI abstract and analytical database as a basic dataset, it allowed us to trace the evolution of topics with a trend for each term within each topic and helped researchers from various industries understand the interrelationships between topics and terms in the content of multimodal texts.</p></trans-abstract><kwd-group xml:lang="ru"><kwd>LDA</kwd><kwd>BERTopic</kwd><kwd>search for implicit connections</kwd><kwd>trend</kwd><kwd>semantic graph</kwd><kwd>PubMed</kwd><kwd>Dimensions AI</kwd><kwd>thematic modeling</kwd><kwd>perplexity</kwd><kwd>consistency</kwd></kwd-group><kwd-group xml:lang="en"><kwd>LDA</kwd><kwd>BERTopic</kwd><kwd>search for implicit connections</kwd><kwd>trend</kwd><kwd>semantic graph</kwd><kwd>PubMed</kwd><kwd>Dimensions AI</kwd><kwd>thematic modeling</kwd><kwd>perplexity</kwd><kwd>consistency</kwd></kwd-group></article-meta></front><back><ref-list><title>Список литературы</title><ref id="B1"><mixed-citation>1. Yurchak V.A. Tools for solving problems of recognition and clustering of data from documents using machine learning methods / Zolotarev O.V., Yurchak V.A. // IVD. &amp;ndash; 2023. &amp;ndash; No. 2(98). &amp;ndash; 156-164 P.</mixed-citation></ref><ref id="B2"><mixed-citation>2. Kornei A.O. Semantic-statistical algorithm for determining the categories of aspects in sentiment analysis problems / Kornei A.O., Kryuchkova E.N. // Bulletin of the Southern Federal University. Technical sciences. &amp;ndash; 2020. &amp;nbsp;&amp;ndash; No. 6 (216). &amp;ndash; 66-74 p.</mixed-citation></ref><ref id="B3"><mixed-citation>3. Klimenko S.V. Using the ontological approach to analyze natural language texts / Klimenko S.V., Zolotarev&amp;nbsp;O.V., Sharin M.M. // Bulletin of the Russian New University. Series: Complex Systems: Models, Analysis and Management. &amp;ndash; 2017. &amp;ndash; P. 67-71.</mixed-citation></ref><ref id="B4"><mixed-citation>4. Khakimova A.Kh. Approaches to Creating a Multilingual Lexical Resource for Semantometric Assessment of Interlingual Semantic Similarity of Texts / Khakimova A.Kh., Zolotarev O.V., Sharnin M.M. / Nizhny Novgorod State University of Architecture and Civil Engineering, Research Center for Physics and Engineering Informatics. &amp;ndash; 2019. &amp;ndash; P.319-324.</mixed-citation></ref><ref id="B5"><mixed-citation>5. Zolotarev O.V., Khakimova A.Kh., Sharnin M.M. Development of Methods for Intelligent Analysis of Scientific Publications to Monitor Priority Directions for the Development of Preventive and Personalized Medicine&amp;nbsp;/ O.V. Zolotarev, A.Kh. Khakimova, M.M. Sharnin // Bulletin of the Russian New University. Series &amp;quot;Complex Systems&amp;quot;. &amp;ndash; 2019. &amp;ndash; P. 110-117.</mixed-citation></ref><ref id="B6"><mixed-citation>6. Methodology for constructing an associative-hierarchical portrait of a subject area: hierarchy of categories&amp;nbsp;/ Klimenko S.V. et al. // Autonomous Non-Commercial Organization &amp;quot;Institute of Physical and Technical Informatics&amp;quot;. &amp;ndash; 2017. &amp;ndash; P. 251-260.</mixed-citation></ref><ref id="B7"><mixed-citation>7. Model and technology for extracting new terms from medical texts / Zolotarev O.V. et al. // Informatics and its Applications. &amp;ndash; 2022. &amp;ndash; 80-86 p.</mixed-citation></ref><ref id="B8"><mixed-citation>8. The measure of similarity of texts as a tool for assessing intertextuality in the analysis of large collections of documents / Zolotarev O.V. et al. // Bulletin of the Russian New University Series: complex systems: models, analysis, and management. &amp;ndash; 2016. &amp;ndash; 62-71 p.</mixed-citation></ref><ref id="B9"><mixed-citation>9. Program for the allocation of terms from the corpus of texts / Zolotarev O.V. et al. // Autonomous non&amp;ndash;profit organization of Higher Education &amp;quot;Russian New University&amp;quot; &amp;ndash; 2023. &amp;ndash; 1-2 p.</mixed-citation></ref><ref id="B10"><mixed-citation>10. A program for building a structured corpus of texts based on electronic databases of publications / Zolotarev O.V. et al. // Autonomous non&amp;ndash;profit organization of Higher Education &amp;quot;Russian New University&amp;quot; &amp;ndash; 2023.&amp;nbsp;&amp;ndash; 1-2 p.</mixed-citation></ref><ref id="B11"><mixed-citation>11. Farea A., Tripathi Sh., Glazko G., Emmert-Streib F. Investigating the optimal number of topics by advanced text-mining techniques: Sustainable energy research // Engineering Applications of Artificial Intelligence. V. 136, part A. &amp;ndash; 2024. URL: https://doi.org/10.1016/j.engappai.2024.108877.</mixed-citation></ref><ref id="B12"><mixed-citation>12. Li Y., Wang W., Yan X., Gao M., Xiao M. Research on the Application of Semantic Network in Disease Diagnosis Prompts Based on Medical Corpus / International Journal of Innovative Research in Computer Science and Technology (IJIRCST). &amp;ndash; 2024. &amp;ndash; 1-9 p. Available from: https://doi.org/10.55524/ijircst.2024.12.2.1</mixed-citation></ref><ref id="B13"><mixed-citation>13. Bruches E.P. Methods and algorithms for recognizing and linking entities for building systems for automatic extraction of information from scientific texts: dis. for the degree of candidate of technical sciences. &amp;ndash; Novosibirsk: Federal State Budgetary Institution of Science Institute of Informatics Systems named after Ershov, 2021. &amp;ndash; 112 p.</mixed-citation></ref><ref id="B14"><mixed-citation>14. Dudarin P.V. Research and development of models and methods for fuzzy clustering of short texts: dis. for the degree of candidate of technical sciences. &amp;ndash; Ulyanovsk: &amp;quot;Ulyanovsk State Technical University&amp;quot;, 2021. &amp;ndash; 141&amp;nbsp;p.</mixed-citation></ref><ref id="B15"><mixed-citation>15. Tutubalina E.V. Models and methods of automatic processing of unstructured data in the biomedical field: dis Doctor of Computer Science &amp;ndash; Kazan: Kazan (Volga Region) Federal University, 2023 &amp;ndash; 225 p.</mixed-citation></ref><ref id="B16"><mixed-citation>16. Korney A.O. Medical Methods and scientific algorithms of the aspect analysis table for determining tonality based on the corpus of a hybrid presented semantic-statistical model of prompts of a natural language: dis. for the abso academic UMLS degree of Candidate of International Technical Semantic Sciences. &amp;ndash; Barnaul: allocation of the Federal semantic state budgetary foreign educational institution of higher international education &amp;quot;International Altai State Medical Technical University named after I.I. Polzunov&amp;quot;, the consequence of 2021. &amp;ndash; 134 p.</mixed-citation></ref></ref-list></back></article>