<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd">
<article article-type="research-article" dtd-version="1.2" xml:lang="ru" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><front><journal-meta><journal-id journal-id-type="issn">2518-1092</journal-id><journal-title-group><journal-title>Research result. Information technologies</journal-title></journal-title-group><issn pub-type="epub">2518-1092</issn></journal-meta><article-meta><article-id pub-id-type="doi">10.18413/2518-1092-2021-6-1-0-5</article-id><article-id pub-id-type="publisher-id">2374</article-id><article-categories><subj-group subj-group-type="heading"><subject>INFORMATION SYSTEM AND TECHNOLOGIES</subject></subj-group></article-categories><title-group><article-title>COMPARATIVE ANALYSIS OF TEXT DATA STORAGE FORMATS FOR FURTHER PROCESSING BY METHODS OF MACHINE LEARNING</article-title><trans-title-group xml:lang="en"><trans-title>COMPARATIVE ANALYSIS OF TEXT DATA STORAGE FORMATS FOR FURTHER PROCESSING BY METHODS OF MACHINE LEARNING</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author"><name-alternatives><name xml:lang="ru"><surname>Naumov</surname><given-names>Ruslan Kirillovich</given-names></name><name xml:lang="en"><surname>Naumov</surname><given-names>Ruslan Kirillovich</given-names></name></name-alternatives><email>ruslan.naumow.dake@gmail.com</email></contrib><contrib contrib-type="author"><name-alternatives><name xml:lang="ru"><surname>Zhelezkov</surname><given-names>Nikita Eduardovich</given-names></name><name xml:lang="en"><surname>Zhelezkov</surname><given-names>Nikita Eduardovich</given-names></name></name-alternatives><email>nikita.e.zhelezkov@gmail.com</email></contrib></contrib-group><pub-date pub-type="epub"><year>2021</year></pub-date><volume>6</volume><issue>1</issue><fpage>0</fpage><lpage>0</lpage><self-uri content-type="pdf" xlink:href="/media/information/2021/1/ИТ_5.pdf" /><abstract xml:lang="ru"><p>Today, one of the most promising areas in the field of information technology is machine learning. It is used in many areas of activity, including text data analysis. Between the data collection and analysis stages, there is a data storage stage. One of the issues that requires careful consideration is the choice of storage format for this data. This article provides an overview of the most popular text data storage formats used in machine learning. The criteria for the comparison are determined. The result of the work is a comparative table of the analyzed formats. Based on the results, a conclusion is made about the most efficient way to store text data</p></abstract><trans-abstract xml:lang="en"><p>Today, one of the most promising areas in the field of information technology is machine learning. It is used in many areas of activity, including text data analysis. Between the data collection and analysis stages, there is a data storage stage. One of the issues that requires careful consideration is the choice of storage format for this data. This article provides an overview of the most popular text data storage formats used in machine learning. The criteria for the comparison are determined. The result of the work is a comparative table of the analyzed formats. Based on the results, a conclusion is made about the most efficient way to store text data</p></trans-abstract><kwd-group xml:lang="ru"><kwd>machine learning</kwd><kwd>text data</kwd><kwd>text formats</kwd><kwd>data serialization</kwd></kwd-group><kwd-group xml:lang="en"><kwd>machine learning</kwd><kwd>text data</kwd><kwd>text formats</kwd><kwd>data serialization</kwd></kwd-group></article-meta></front><back><ref-list><title>Список литературы</title><ref id="B1"><mixed-citation>Information Extraction from Multistructured Data and its Transformation into a Target Schema / Briukhov D.O., Stupnikov S.A., Kalinichenko L.A., Vovchenko A.E. // Selected Papers of the XVII International Conference on Data Analytics and Management in Data Intensive Domains. 2015.&amp;nbsp; pp. 81&amp;ndash;90.</mixed-citation></ref><ref id="B2"><mixed-citation>Bedarev N.V., Voinov A.A. Natural Language Texts and Methods of Structured Data Extraction // International scientific and technological conference of students and young scientists &amp;ldquo;Youth. The science. Technologies&amp;quot;. 2018. No. 2. pp. 37-42.</mixed-citation></ref><ref id="B3"><mixed-citation>Borisov A.V. Modern solutions and approaches to processing arrays of unstructured textual information in the field of big data // Problems of modern science and education. 2017.pp. 49-52.</mixed-citation></ref><ref id="B4"><mixed-citation>Petrova I. Y., Goryanin S.V. Information and analytical system EcoHealth for storage and analysis of structured and unstructured big data // Engineering and construction bulletin of the Caspian Sea region: scientific and technical journal 2017. No. 3 (21). pp. 66&amp;ndash;71.</mixed-citation></ref><ref id="B5"><mixed-citation>Pogodin G.V., Figo D.M., Vasiliev E.N. Serialization of data structures for storage and transmission in information systems. Methods and means // &amp;quot;Youth in Science&amp;quot;. Collection of reports of the 16th scientific and technical conference. 2017. No. 2. P. 231-236.</mixed-citation></ref><ref id="B6"><mixed-citation>CSV URL: https://ru.wikipedia.org/wiki/CSV (date of the request: 11.12.2020).</mixed-citation></ref><ref id="B7"><mixed-citation>Samoylenko N. Python for Network Engineers Release 3.0 URL: https://pyneng.readthedocs.io/_/downloads/ru/latest/pdf/ (date of the request: 12.12.2020).</mixed-citation></ref><ref id="B8"><mixed-citation>Extensible Markup Language (XML) URL: https://www.w3.org/XML/ (date of the request: 12.12.2020).</mixed-citation></ref><ref id="B9"><mixed-citation>Kanaev K.A., Faleeva E.V., Ponomarchuk Y.V. Comparative Analysis of Data Exchange Formats for Applications with Client-Server Architecture // Basic research. &amp;ndash; 2015. &amp;ndash; No. 2-25. &amp;ndash; p. 5569-5572.</mixed-citation></ref><ref id="B10"><mixed-citation>Pilgrim M. Dive into Python 3. 2010</mixed-citation></ref><ref id="B11"><mixed-citation>Suchkova E.A., Nikolaeva Yu.V. Development of an optimal data storage structure for decision support systems // Cybernetics and programming. &amp;ndash; 2016. No. 4. pp. 58-64.</mixed-citation></ref><ref id="B12"><mixed-citation>Romanov A.C. Database model for storing texts and their characteristics // Reports of the Tomsk State University of Control Systems and Radioelectronics. 2008. No. 1. pp. 70-73.</mixed-citation></ref><ref id="B13"><mixed-citation>Shevelev O.G. Representation of a set of texts in a relational database for the purpose of linguistic analysis. 2004.</mixed-citation></ref><ref id="B14"><mixed-citation>Dovbenko A.V. Data storage in NoSQL systems on the example of MongoDB. 2015.</mixed-citation></ref><ref id="B15"><mixed-citation>Koroteev M.V., Koroteev K. Review of Some Contemporary Trends in Machine Learning Technology&amp;nbsp;// E-Management. 2018. pp. 26-35.</mixed-citation></ref><ref id="B16"><mixed-citation>Ruycheva, A.P. Development of machine learning // Modern technologies in education: materials of an international scientific and practical conference. 2017. Part 1.pp. 232-237.</mixed-citation></ref><ref id="B17"><mixed-citation>Mison: A Fast JSON Parser for Data Analytics / Li, Yinan, Katsipoulakis N., Chandramouli, B., Goldstein J., Kossman D. 2017.</mixed-citation></ref><ref id="B18"><mixed-citation>Savin I.V. Analysis of data storage systems // Bulletin of the Tula State University Technical Sciences. 2019. pp. 193-196.</mixed-citation></ref><ref id="B19"><mixed-citation>Basov O.O., Saitov I.A. Main channels of interpersonal communication and their projection onto infocommunication systems // Proceedings of SPIIRAS. (7), pp. 122&amp;ndash;140.</mixed-citation></ref><ref id="B20"><mixed-citation>The popularity of Python open-source systems is growing. Moscow: Open Systems, 2019.S. 5-11.</mixed-citation></ref><ref id="B21"><mixed-citation>Langdale G., Lemire D. Parsing gigabytes of JSON per second // The VLDB Journal: The International Journal on Very Large Data Bases. 2019. 28(6). pp. 941.</mixed-citation></ref></ref-list></back></article>