Abstract
ZooKeeper is a coordination service, widely used as a backbone of various distributed systems. Though its reliability is of critical importance, testing is insufficient for an industrial-strength system of the size and complexity of ZooKeeper, and deep bugs can still be found. To this end, we resort to formal TLA\(^+\) specifications to further improve the reliability of ZooKeeper. Our primary objective is usability and automation, rather than full verification. We incrementally develop three levels of specifications for ZooKeeper. We first obtain the protocol specification, which unambiguously specifies the Zab protocol behind ZooKeeper. We then proceed to a finer grain and obtain the system specification, which serves as the super-doc for system development. In order to further leverage the model-level specification to improve the reliability of the code-level implementation, we develop the test specification, which guides the explorative testing of the ZooKeeper implementation. The formal specifications help eliminate the ambiguities in the protocol design and provide comprehensive system documentation. They also help find critical deep bugs in system implementation, which are beyond the reach of state-of-the-art testing techniques. Our specifications have been merged into the official Apache ZooKeeper project.
This work is supported by the National Natural Science Foundation of China (62372222), the CCF-Huawei Populus Grove Fund (CCF-HuaweiFM202304), the Cooperation Fund of Huawei-Nanjing University Next Generation Programming Innovation Lab (YBN2019105178SW38), the Fundamental Research Funds for the Central Universities (020214912222) and the Collaborative Innovation Center of Novel Software Technology and Industrialization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Super-doc refers to the precise, concise and testable documentation of the system implementation, which can be explored and experimented on with tools [37].
References
Apache ZooKeeper’s issue tracking system. https://issues.apache.org/jira/projects/ZOOKEEPER/issues
Issue: ZK-2845. https://issues.apache.org/jira/browse/ZOOKEEPER-2845
Issue: ZK-3615. https://issues.apache.org/jira/browse/ZOOKEEPER-3615
Issue: ZK-3911. https://issues.apache.org/jira/browse/ZOOKEEPER-3911
Three levels of TLA\(^+\) specifications for ZooKeeper. https://github.com/Disalg-ICS-NJU/zookeeper-tla-spec
TLA\(^+\) home page. https://lamport.azurewebsites.net/tla/tla.html
TLA\(^+\) specification for Paxos. https://github.com/tlaplus/Examples/blob/master/specifications/PaxosHowToWinATuringAward/Paxos.tla
TLA\(^+\) specification for PaxosStore. https://github.com/Starydark/PaxosStore-tla
TLA\(^+\) specification for Raft. https://github.com/ongardie/raft.tla
TLA\(^+\) specifications for the Apache ZooKeeper project. https://github.com/apache/zookeeper/tree/master/zookeeper-specifications
The use of TLA\(^+\) in industry. https://lamport.azurewebsites.net/tla/industrial-use.html
Zab’s wiki. https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab1.0
ZooKeeper home page. https://zookeeper.apache.org/
Bourque, P., Fairley, R.E., Society, I.C.: Guide to the Software Engineering Body of Knowledge (SWEBOK(R)): Version 3.0, 3rd edn. IEEE Computer Society Press, Washington, DC, USA (2014)
Davis, A.J.J., Hirschhorn, M., Schvimer, J.: eXtreme modelling in practice. Proc. VLDB Endow. 13(9), 1346–1358 (2020). https://doi.org/10.14778/3397230.3397233
Dorminey, S.: Kayfabe: model-based program testing with TLA\(^+\)/TLC. Technical report, Microsoft Azure WAN (2020). https://conf.tlapl.us/2020/11-Star_Dorminey-Kayfabe_Model_based_program_testing_with_TLC.pdf
Gafni, E., Lamport, L.: Disk Paxos. Distrib. Comput. 16(1), 1–20 (2003). https://doi.org/10.1007/s00446-002-0070-8
Gu, X., Cao, W., Zhu, Y., Song, X., Huang, Y., Ma, X.: Compositional model checking of consensus protocols specified in TLA\(^+\) via interaction-preserving abstraction. In: Proceedings of International Symposium on Reliable Distributed Systems (SRDS 2022). IEEE (2022). https://doi.org/10.1109/srds55811.2022.00018
Gu, X., Wei, H., Qiao, L., Huang, Y.: Raft with out-of-order executions. Int. J. Softw. Informatics 11(4), 473–503 (2021). https://doi.org/10.21655/ijsi.1673-7288.00257
Gunawi, H.S., et al.: FATE and DESTINI: a framework for cloud recovery testing. In: Proceedings of the 8th USENIX conference on Networked Systems Design and Implementation. p. 239 (2011). https://dl.acm.org/doi/10.5555/1972457.1972482
Huang, B., Ouyang, L.: Pull request for ZOOKEEPER-3615: provide formal specification and verification using TLA\(^+\) for Zab #1690. https://github.com/apache/zookeeper/pull/1690
Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: ZooKeeper: wait-free coordination for internet-scale systems. In: Proceedings of ATC 2010, USENIX Annual Technical Conference, pp. 145–158. USENIX (2010). https://dl.acm.org/doi/10.5555/1855840.1855851
Junqueira, F.P., Reed, B.C., Serafini, M.: Dissecting Zab. Technical report, YL-2010-007, Yahoo! Research, Sunnyvale, CA, USA (2010). https://cwiki.apache.org/confluence/download/attachments/24193444/yl-2010-007.pdf
Junqueira, F.P., Reed, B.C., Serafini, M.: Zab: high-performance broadcast for primary-backup systems. In: Proceedings of DSN 2011, IEEE/IFIP Conference on Dependable Systems and Networks, pp. 245–256. IEEE (2011). https://doi.org/10.1109/DSN.2011.5958223
Kim, B.H., Kim, T., Lie, D.: Modulo: finding convergence failure bugs in distributed systems with divergence resync models. https://www.usenix.org/conference/atc22/presentation/kim-beom-heyn
Kuprianov, A., Konnov, I.: Model-based testing with TLA\(^+\) and Apalache. Technical report, Informal Systems (2020). https://conf.tlapl.us/2020/09-Kuprianov_and_Konnov-Model-based_testing_with_TLA_+_and_Apalache.pdf
Lamport, L.: Fast Paxos. Distrib. Comput. 19(2), 79–103 (2006). https://doi.org/10.1007/s00446-006-0005-x
Lamport, L.: The PlusCal Code for Byzantizing Paxos by Refinement. TechReport, Microsoft Research (2011)
Lamport, L., Merz, S., D, D.: A TLA\(^+\) specification of Paxos and its refinement (2019). https://github.com/tlaplus/Examples/tree/master/specifications/Paxos
Leesatapornwongsa, T., Gunawi, H.S.: SAMC: a fast model checker for finding Heisenbugs in distributed systems (demo). In: Proceedings of the 2015 International Symposium on Software Testing and Analysis, ISSTA 2015, pp. 423–427. Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2771783.2784771
Leesatapornwongsa, T., Hao, M., Joshi, P., Lukman, J.F., Gunawi, H.S.: SAMC: semantic-aware model checking for fast discovery of deep bugs in cloud systems. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI 2014, pp. 399–414. USENIX Association, Berkeley (2014). https://dl.acm.org/doi/10.5555/2685048.2685080
Lukman, J.F., et al.: Flymc: highly scalable testing of complex interleavings in distributed systems. In: Proceedings of the Fourteenth EuroSys Conference 2019, pp. 1–16 (2019). https://doi.org/10.1145/3302424.3303986
Marić, O., Sprenger, C., Basin, D.: Cutoff bounds for consensus algorithms. In: Majumdar, R., Kunčak, V. (eds.) CAV 2017. LNCS, vol. 10427, pp. 217–237. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63390-9_12
Medeiros, A.: ZooKeeper’s atomic broadcast protocol: theory and practice (2012). https://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf
Moraru, I., Andersen, D.G., Kaminsky, M.: There is more consensus in egalitarian parliaments. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 358–372 (2013). https://doi.org/10.1145/2517349.2517350
Newcombe, C., Rath, T., Zhang, F., Munteanu, B., Brooker, M., Deardeuff, M.: How amazon web services uses formal methods. Commun. ACM 58(4), 66–73 (2015). https://doi.org/10.1145/2699417
Ongaro, D., Ousterhout, J.: In search of an understandable consensus algorithm. In: Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC 2014, pp. 305–320. USENIX Association, Berkeley (2014). https://dl.acm.org/doi/10.5555/2643634.2643666
Ozkan, B.K., Majumdar, R., Niksic, F., Befrouei, M.T., Weissenbacher, G.: Randomized testing of distributed systems with probabilistic guarantees. Proc. ACM Program. Lang. 2(OOPSLA), 1–28 (2018). https://doi.org/10.1145/3276530
Ozkan, B.K., Majumdar, R., Oraee, S.: Trace aware random testing for distributed systems. Proc. ACM Program. Lang. 3(OOPSLA), 1–29 (2019). https://doi.org/10.1145/3360606
Wang, D., Dou, W., Gao, Y., Wu, C., Wei, J., Huang, T.: Model checking guided testing for distributed systems. In: Proceedings of the Eighteenth European Conference on Computer Systems, pp. 127–143 (2023). https://doi.org/10.1145/3552326.3587442
Yin, J.Q., Zhu, H.B., Fei, Y.: Specification and verification of the Zab protocol with TLA\(^+\). J. Comput. Sci. Technol. 35, 1312–1323 (2020). https://doi.org/10.1007/s11390-020-0538-7
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ouyang, L., Huang, Y., Huang, B., Ma, X. (2024). Leveraging TLA\(^+\) Specifications to Improve the Reliability of the ZooKeeperCoordination Service. In: Hermanns, H., Sun, J., Bu, L. (eds) Dependable Software Engineering. Theories, Tools, and Applications. SETTA 2023. Lecture Notes in Computer Science, vol 14464. Springer, Singapore. https://doi.org/10.1007/978-981-99-8664-4_11
Download citation
DOI: https://doi.org/10.1007/978-981-99-8664-4_11
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8663-7
Online ISBN: 978-981-99-8664-4
eBook Packages: Computer ScienceComputer Science (R0)