Ulanov Kirill Anatolyevich (Postgraduate Student, Department of Information Systems
Moscow State Technological University "Stankin"
)
|
The article is devoted to the problem of automatic quality control of "dark" data streams — Kafka topics, for which there is no reference markup and a pre‐known scheme. The aim of the work is to develop a self—learning method for generating metrics for the quality of streaming data, capable of evaluating the reliability of unexplored events in real time without manual rules. A streaming algorithm is proposed in which a lightweight online encoder extracts features, a Boolean augmentation mask creates positive and negative examples, and a loss rank function is trained on the principle of self-learning ranking. On the NYC Taxi open set, the method was ahead of the rule-based tests, Isolation Forest and Deep SVDD: the P1000 increased to 0.74, and the error detection delay decreased to 32 seconds when loading 0.55 vCPU. The findings confirm that self-learning ranking is an effective and resource-saving framework for end-to-end data quality control in streaming systems.
Keywords:streaming data processing, data quality control, self-supervised learning, rank learning, Apache Kafka, unexplored streams.
|
|
|
Read the full article …
|
Citation link: Ulanov K. A. SELF-LEARNING FORMATION OF DATA QUALITY METRICS FOR UNEXPLORED STREAMS // Современная наука: актуальные проблемы теории и практики. Серия: Естественные и Технические Науки. -2025. -№06. -С. 241-245 DOI 10.37882/2223-2966.2025.06.46 |
|
|