Monitoring fault tolerance in distributed systems

Cover Page

Cite item

Full Text

Open Access Open Access
Restricted Access Access granted
Restricted Access Subscription or Fee Access

Abstract

The goal of this study is to develop and verify a monitoring model for reliability and availability in distributed systems, built on probabilistic component characteristics and accounting for dependent failures. Modern distributed systems require accurate failure prediction methods that can account for complex dependencies between nodes and support reliable performance under high loads. Traditional approaches based on empirical data analysis often fall short in predicting system states under changing loads, which limits their applicability. In this research, the developed probabilistic model underwent verification using numerical simulation and accuracy assessment through Kullback–Leibler divergence and mean squared error (MSE), confirming its accuracy and practical value. The model’s versatility was proven experimentally, demonstrating its ability to adapt to various types of distributed systems while providing precise real-time predictions of availability and resilience. Numerical experiments showed that the proposed model can be a reliable tool for managing fault tolerance and load balancing. Thus, the developed model is an effective solution for enhancing the reliability of distributed systems, exhibiting a high degree of versatility and making it valuable for a wide range of applications.

Full Text

Restricted Access

About the authors

Danil I. Sukhoplyuev

MIREA – Russian Technological University

Author for correspondence.
Email: sukhoplyuev.d.i@edu.mirea.ru
SPIN-code: 3931-0217

postgraduate student

Russian Federation, Moscow

Alexey N. Nazarov

Federal Research Center Computer Science and Control of Russian Academy of Sciences

Email: a.nazarov06@bk.ru
ORCID iD: 0000-0002-0497-0296
SPIN-code: 6032-5302

Dr. Sci. (Eng.), Professor

Russian Federation, Moscow

References

  1. Yermagambetov R.T., Kiselev E.S. Modern big data storage and processing systems: Hadoop and apache spark. Forum of Young Scientists. 2018. No. 8 (24). Pp. 229–239. (In Rus.). EDN: VLYZSA.
  2. Dzidzava E.T., Akhmedov K.M. Big data and HADOOP: Review report. Bulletin of the Magistracy. 2021. No. 1-1 (112). Pp. 30–32. (In Rus.). EDN: SCTUXC.
  3. Nekratyuk A.A., Safaryan O.A. Using the MAPREDUCE method in BIG DATA. Young Researcher of the Don. 2020. No. 3 (24). Pp. 174–179. (In Rus.) EDN” WJCAAM.
  4. Tatarnikova T.M., Arkhiptsev E.D., Karmanovsky N.S. Determining the cluster size and number of replicas for high-load information systems. Izvestiya Vysshikh Uchebnykh Zavedeniy. Instrument Making. 2023. Vol. 66. No. 8. Pp. 646–651. (In Rus.). doi: 10.17586/0021-3454-2023-66-8-646-651. EDN: GHKBJE.
  5. Copik M., Calotoiu A., Pengyu Zhou et al. FaaSKeeper: Learning from building serverless services with zookeeper as an example. In: HPDC’24: Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing. NY.: Association for Computing Machinery, 2024. Pp. 94–108. doi: 10.1145/3625549.3658661.
  6. Grigoryan T.G. Fault-tolerant systems and methods for achieving them. Scientific Aspect. 2024. Vol. 26. No. 7. Pp. 3264–3268. (In Rus.). EDN: EBZTNN.
  7. Lubkov N.V., Stepanyants A.S., Viktorova V.S. Reliability models and analysis of protected systems. Automation and Remote Control. 2018. No. 7. Pp. 117–137. (In Rus.). doi: 10.31857/S000523100000271-2. EDN: YALAPB.
  8. Fokin A.B. Method for calculating connectivity probabilities (readiness coefficients) in a telecommunications network supporting fault-tolerant mechanisms. Information Systems and Technologies. 2023. No. 4 (138). Pp. 83–91. (In Rus.). EDN: CWQJBV.
  9. Aglianò P., Ugolini S. Structural and universal completeness in algebra and logic. doi: 10.48550/arXiv.2309.14151. URL: https://arxiv.org/abs/2309.14151
  10. Lemeshko B.Yu. Problems of Applying Non-Parametric Goodness-of-Fit Tests in Measurement Processing Tasks / B.Yu. Lemeshko, S.B. Lemeshko // Systems of Analysis and Data Processing. 2021. No. 2(82). P. 47-66. doi: 10.17212/2782-2001-2021-2-47-66. EDN WJARCI.
  11. Khatskevich V.L. On some extreme properties of means and mathematical expectations of random variables. Bulletin of Voronezh State Technical University. 2013. Vol. 9. No. 3-1. Pp. 39–44. (In Rus.). EDN: QCQYVZ.
  12. Gafarova L.M., Zavyalova I.G., Mustafin N.N. On the features of using the pearson χ2 goodness-of-fit test. Economic and Socio-Humanitarian Research. 2015. No. 4 (8). Pp. 63–67. (In Rus.). EDN: VEIMQN.
  13. Golovkina A.G., Kozyuchenko V.A., Klimenko I.S. Successive approximation method for building a dynamic polynomial regression model. Bulletin of St. Petersburg University. Applied Mathematics. Informatics. Control Processes. 2022. Vol. 18. No. 4. Pp. 487–500. (In Rus.). doi: 10.21638/11701/spbu10.2022.404. EDN QXVJIL.
  14. Sukhoplyuev D.I., Nazarov A.N. Analysis of application-level load balancing algorithms. In: Systems of signals generating and processing in the field of on-board communications. Moscow, 2023. Pp. 1–4. doi: 10.1109/IEEECONF56737.2023.10092019.
  15. Alfara A.Yu.A., Korolev D.V., Zaitsev K.S., Dunaev M.E. Development of a monitoring system for a server application. International Journal of Open Information Technologies. 2023. Vol. 11. No. 8. Pp. 24–31. (In Rus.). EDN: OCTBSB.

Supplementary files

Supplementary Files
Action
1. JATS XML
2. Fig. 1. Formal architecture of NameNode – DataNode in Apache Hadoop (Source: https://www.analyticsvidhya.com/blog/2022/05/workings-of-hadoop-distributed-file-system-hdfs/)

Download (203KB)
3. Fig. 2. Nginx load balancer (Source: https://coderpad.io/blog/development/how-to-configure-different-load-balancing-algorithms-on-nginx/)

Download (216KB)