Журнал «Современная Наука»

Russian (CIS)English (United Kingdom)
MOSCOW +7(495)-142-86-81

TEXT COMPARISON SYSTEM: PREPROCESSING STAGES AND DOCUMENT ORIGINALITY ASSESSMENT

Krez Karina Sergeevna  (Graduate Student, Belarusian State University of Informatics and Radioelectronics, Minsk, Belarus )

Shneiderov Yevgeny Nikolaevich  (PhD in Engineering, Associate Professor, Belarusian State University of Informatics and Radioelectronics, Minsk, Belarus )

Golushko Vadim Igorevich  (Belarusian State University of Informatics and Radioelectronics, Minsk, Belarus )

This article presents a methodology for automated document originality assessment based on the integration of modern natural language processing methods and classical text comparison algorithms. The proposed approach includes a three-stage data processing algorithm: preliminary document cleaning and normalization, their semantic vectorization using the Sentence-BERT model, and subsequent similarity assessment using a hybrid algorithm. During the vectorization stage, the text is converted into 384-dimensional embeddings reflecting its semantic content, enabling efficient semantic searches for potential borrowing sources using approximate nearest neighbor search (ANS) algorithms. To accurately quantify the degree of similarity, the hashing shinnling method is used, ensuring deterministic comparison of text fragments. The developed algorithm automates the process of verifying the uniqueness of academic papers, reducing the complexity of manual verification and increasing the reliability of the results. The proposed hybrid approach combines the high efficiency of semantic search with the accuracy of classical comparison methods and can be effectively applied when working with large text databases.

Keywords:Vectorization of documents, evaluation of the originality of documents, equalization, embedding, shingle.

 

Read the full article …



Citation link:
Krez K. S., Shneiderov Y. N., Golushko V. I. TEXT COMPARISON SYSTEM: PREPROCESSING STAGES AND DOCUMENT ORIGINALITY ASSESSMENT // Современная наука: актуальные проблемы теории и практики. Серия: Естественные и Технические Науки. -2026. -№03. -С. 116-121 DOI 10.37882/2223-2966.2026.03.19
LEGAL INFORMATION:
Reproduction of materials is permitted only for non-commercial purposes with reference to the original publication. Protected by the laws of the Russian Federation. Any violations of the law are prosecuted.
© ООО "Научные технологии"