Abstract
A failure detector is an important building block for fault-tolerant distributed computing: mechanisms such as distributed consensus and group communication rely on the information provided by failure detectors in order to make progress and terminate. As such, erroneous information provided by the failure detector (or the absence of it) may delay decision-making or lead the upper-layer fault-tolerant mechanism to take incorrect decisions (e.g., the exclution of a correct process from a group membership). On the other hand, the implementation of failure detectors that can precisely identify failures is restricted by the actual behaviour of a system, especially in settings where message transmission delays and system loads can vary over time. In this paper we explore the use of artificial neural networks in order to implement failure detectors that are dynamically adapted to the current communication load conditions. The training patterns used to feed the neural network were obtained by using Simple Network Management Protocol (SNMP) agents over MIB – Management Information Base variables. The output of such neural network is an estimation for the arrival time for the failure detector to receive the next heartbeat message from a remote process. The suggested approach was fully implemented and tested over a set of GNU/Linux networked workstations. In order to analyze the efficiency of our approach, we have run a series of experiments where network loads were varied randomly, and we measured several QoS parameters, comparing our detector against known implementations. The performance data collected indicate that neural networks and MIB variables can indeed be combined to improve the QoS of failure detectors.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal of the ACM 43, 225–267 (1996)
Hurfin, M., Macêdo, R., Raynal, M., Tronel, F.: A general framework to solve agreement problems. In: SRDS 1999: Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems, Washington, DC, USA, pp. 56–65. IEEE Computer Society, Los Alamitos (1999)
Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty process. Journal of the ACM 32, 374–382 (1985)
Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the presence of partial synchrony. Journal of the ACM 35, 288–323 (1988)
Verissimo, P., Casimiro, A., Fetzer, C.: The timely computing base: Timely actions in the presence of uncertain timeliness. In: DSN 2000: Proceedings of the 2000 International Conference on Dependable Systems and Networks, Washington, DC, USA, pp. 533–542. IEEE Computer Society Press, Los Alamitos (2000)
Gorender, S., Macêdo, R., Raynal, M.: A hubrid and adaptive model for fault-tolerant distribuded computing. In: DSN 2005: Proceedings of the 2005 International Conference on Dependable Systems and Networks, Yokohama, Japan, pp. 412–421. IEEE Computer Society Press, Los Alamitos (2005)
Chandra, T.D., Hadzilacos, V., Toueg, S.: The weakest failure detector for solving consensus. Journal of the ACM 43, 685–722 (1996)
Macêdo, R.: Failure detection in asynchronous distributed systems. In: II WTF: Workshop on Tests and Fault-Tolerance, Curitiba, PR, Brazil, pp. 76–81 (2000)
Nunes, R., Jansch-Porto, I.: A lightweight interface to predict communication delays using time series. In: de Lemos, R., Weber, T.S., Camargo Jr., J.B. (eds.) LADC 2003, vol. 2847, pp. 254–263. Springer, Heidelberg (2003)
Bertier, M., Marin, O., Sens, P.: Implementation and performance evaluation of an adaptable failure detector. In: DSN 2002: Proceedings of the 2002 International Conference on Dependable Systems and Networks, Washington, DC, USA, pp. 354–363. IEEE Computer Society Press, Los Alamitos (2002)
Devianov, B., Toueg, S.: Failure detector service for dependable computing. In: DSN 2000: Proceedings of the 2000 International Conference on Dependable Systems and Networks, Washington, DC, USA, pp. B14–B15. IEEE Computer Society Press, Los Alamitos (2000)
Larrea, M., Fernández, A., Arévalo, S.: Optimal implementation of the weakest failure detector for solving consensus (brief announcement). In: PODC 2000: Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing, p. 334. ACM Press, New York (2000)
Sotoma, I., Madeira, E.R.M.: Adaptation - algorithms to adaptive fault monitoring and their implementation on corba. In: DOA 2001: Proceedings of the Third International Symposium on Distributed Objects and Applications, Washington, DC, USA, pp. 219–228. IEEE Computer Society Press, Los Alamitos (2001)
Chen, W., Toueg, S., Aguilera, M.K.: On the quality of service of failure detectors. IEEE Trans. Comput. 51, 13–32 (2002)
Bertier, M., Marin, O., Sens, P.: Performance analysis of hierarchical failure detector. In: DSN 2003: Proceedings of the 2003 International Conference on Dependable Systems and Networks, San-Francisco, USA, pp. 635–644. IEEE Computer Society Press, Los Alamitos (2003)
Macêdo, R., Lima, F.: Improving the quality of service of failure detectors with snmp and artificial neural networks. In: Simpósio Brasileiro de Redes de Computadores, SBRC 2004, short-paper track, pp. 583–586. SBC - Brazilian Computer Society, Brazil (2004)
Haykin, S.: Neural Networks: A Comprehensive Foundation, 1st edn. MacMillan Publishing Company, Basingstoke (1994)
McCloghrie, K., Rose, M.: Management Information Base for Network Management of TCP/IP-based internets: MIB-II. RFC 1213, Standard (1991) Updated by RFCs 2011, 2012 (2013)
Case, J., Fedor, M., Schoffstall, M., Davin, J.: Simple Network Management Protocol (SNMP). RFC 1157, Historic (1990)
Jacobson, V.: Congestion avoidance and control. In: SIGCOMM 1988: Symposium proceedings on Communications architectures and protocols, pp. 314–329. ACM Press, New York (1988)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lima, F., Macêdo, R. (2005). Adapting Failure Detectors to Communication Network Load Fluctuations Using SNMP and Artificial Neural Nets. In: Maziero, C.A., Gabriel Silva, J., Andrade, A.M.S., de Assis Silva, F.M. (eds) Dependable Computing. LADC 2005. Lecture Notes in Computer Science, vol 3747. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11572329_16
Download citation
DOI: https://doi.org/10.1007/11572329_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29572-3
Online ISBN: 978-3-540-32092-0
eBook Packages: Computer ScienceComputer Science (R0)