Leveraging TLA $$^+$$ Specifications to Improve the Reliability of the ZooKeeperCoordination Service | SpringerLink
Skip to main content

Leveraging TLA\(^+\) Specifications to Improve the Reliability of the ZooKeeperCoordination Service

  • Conference paper
  • First Online:
Dependable Software Engineering. Theories, Tools, and Applications (SETTA 2023)

Abstract

ZooKeeper is a coordination service, widely used as a backbone of various distributed systems. Though its reliability is of critical importance, testing is insufficient for an industrial-strength system of the size and complexity of ZooKeeper, and deep bugs can still be found. To this end, we resort to formal TLA\(^+\) specifications to further improve the reliability of ZooKeeper. Our primary objective is usability and automation, rather than full verification. We incrementally develop three levels of specifications for ZooKeeper. We first obtain the protocol specification, which unambiguously specifies the Zab protocol behind ZooKeeper. We then proceed to a finer grain and obtain the system specification, which serves as the super-doc for system development. In order to further leverage the model-level specification to improve the reliability of the code-level implementation, we develop the test specification, which guides the explorative testing of the ZooKeeper implementation. The formal specifications help eliminate the ambiguities in the protocol design and provide comprehensive system documentation. They also help find critical deep bugs in system implementation, which are beyond the reach of state-of-the-art testing techniques. Our specifications have been merged into the official Apache ZooKeeper project.

This work is supported by the National Natural Science Foundation of China (62372222), the CCF-Huawei Populus Grove Fund (CCF-HuaweiFM202304), the Cooperation Fund of Huawei-Nanjing University Next Generation Programming Innovation Lab (YBN2019105178SW38), the Fundamental Research Funds for the Central Universities (020214912222) and the Collaborative Innovation Center of Novel Software Technology and Industrialization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8464
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Super-doc refers to the precise, concise and testable documentation of the system implementation, which can be explored and experimented on with tools [37].

References

  1. Apache ZooKeeper’s issue tracking system. https://issues.apache.org/jira/projects/ZOOKEEPER/issues

  2. Issue: ZK-2845. https://issues.apache.org/jira/browse/ZOOKEEPER-2845

  3. Issue: ZK-3615. https://issues.apache.org/jira/browse/ZOOKEEPER-3615

  4. Issue: ZK-3911. https://issues.apache.org/jira/browse/ZOOKEEPER-3911

  5. MET. https://github.com/Lingzhi-Ouyang/MET

  6. Three levels of TLA\(^+\) specifications for ZooKeeper. https://github.com/Disalg-ICS-NJU/zookeeper-tla-spec

  7. TLA\(^+\) home page. https://lamport.azurewebsites.net/tla/tla.html

  8. TLA\(^+\) specification for Paxos. https://github.com/tlaplus/Examples/blob/master/specifications/PaxosHowToWinATuringAward/Paxos.tla

  9. TLA\(^+\) specification for PaxosStore. https://github.com/Starydark/PaxosStore-tla

  10. TLA\(^+\) specification for Raft. https://github.com/ongardie/raft.tla

  11. TLA\(^+\) specifications for the Apache ZooKeeper project. https://github.com/apache/zookeeper/tree/master/zookeeper-specifications

  12. The use of TLA\(^+\) in industry. https://lamport.azurewebsites.net/tla/industrial-use.html

  13. Zab’s wiki. https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab1.0

  14. ZooKeeper home page. https://zookeeper.apache.org/

  15. Bourque, P., Fairley, R.E., Society, I.C.: Guide to the Software Engineering Body of Knowledge (SWEBOK(R)): Version 3.0, 3rd edn. IEEE Computer Society Press, Washington, DC, USA (2014)

    Google Scholar 

  16. Davis, A.J.J., Hirschhorn, M., Schvimer, J.: eXtreme modelling in practice. Proc. VLDB Endow. 13(9), 1346–1358 (2020). https://doi.org/10.14778/3397230.3397233

    Article  Google Scholar 

  17. Dorminey, S.: Kayfabe: model-based program testing with TLA\(^+\)/TLC. Technical report, Microsoft Azure WAN (2020). https://conf.tlapl.us/2020/11-Star_Dorminey-Kayfabe_Model_based_program_testing_with_TLC.pdf

  18. Gafni, E., Lamport, L.: Disk Paxos. Distrib. Comput. 16(1), 1–20 (2003). https://doi.org/10.1007/s00446-002-0070-8

    Article  Google Scholar 

  19. Gu, X., Cao, W., Zhu, Y., Song, X., Huang, Y., Ma, X.: Compositional model checking of consensus protocols specified in TLA\(^+\) via interaction-preserving abstraction. In: Proceedings of International Symposium on Reliable Distributed Systems (SRDS 2022). IEEE (2022). https://doi.org/10.1109/srds55811.2022.00018

  20. Gu, X., Wei, H., Qiao, L., Huang, Y.: Raft with out-of-order executions. Int. J. Softw. Informatics 11(4), 473–503 (2021). https://doi.org/10.21655/ijsi.1673-7288.00257

    Article  Google Scholar 

  21. Gunawi, H.S., et al.: FATE and DESTINI: a framework for cloud recovery testing. In: Proceedings of the 8th USENIX conference on Networked Systems Design and Implementation. p. 239 (2011). https://dl.acm.org/doi/10.5555/1972457.1972482

  22. Huang, B., Ouyang, L.: Pull request for ZOOKEEPER-3615: provide formal specification and verification using TLA\(^+\) for Zab #1690. https://github.com/apache/zookeeper/pull/1690

  23. Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: ZooKeeper: wait-free coordination for internet-scale systems. In: Proceedings of ATC 2010, USENIX Annual Technical Conference, pp. 145–158. USENIX (2010). https://dl.acm.org/doi/10.5555/1855840.1855851

  24. Junqueira, F.P., Reed, B.C., Serafini, M.: Dissecting Zab. Technical report, YL-2010-007, Yahoo! Research, Sunnyvale, CA, USA (2010). https://cwiki.apache.org/confluence/download/attachments/24193444/yl-2010-007.pdf

  25. Junqueira, F.P., Reed, B.C., Serafini, M.: Zab: high-performance broadcast for primary-backup systems. In: Proceedings of DSN 2011, IEEE/IFIP Conference on Dependable Systems and Networks, pp. 245–256. IEEE (2011). https://doi.org/10.1109/DSN.2011.5958223

  26. Kim, B.H., Kim, T., Lie, D.: Modulo: finding convergence failure bugs in distributed systems with divergence resync models. https://www.usenix.org/conference/atc22/presentation/kim-beom-heyn

  27. Kuprianov, A., Konnov, I.: Model-based testing with TLA\(^+\) and Apalache. Technical report, Informal Systems (2020). https://conf.tlapl.us/2020/09-Kuprianov_and_Konnov-Model-based_testing_with_TLA_+_and_Apalache.pdf

  28. Lamport, L.: Fast Paxos. Distrib. Comput. 19(2), 79–103 (2006). https://doi.org/10.1007/s00446-006-0005-x

    Article  MathSciNet  Google Scholar 

  29. Lamport, L.: The PlusCal Code for Byzantizing Paxos by Refinement. TechReport, Microsoft Research (2011)

    Google Scholar 

  30. Lamport, L., Merz, S., D, D.: A TLA\(^+\) specification of Paxos and its refinement (2019). https://github.com/tlaplus/Examples/tree/master/specifications/Paxos

  31. Leesatapornwongsa, T., Gunawi, H.S.: SAMC: a fast model checker for finding Heisenbugs in distributed systems (demo). In: Proceedings of the 2015 International Symposium on Software Testing and Analysis, ISSTA 2015, pp. 423–427. Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2771783.2784771

  32. Leesatapornwongsa, T., Hao, M., Joshi, P., Lukman, J.F., Gunawi, H.S.: SAMC: semantic-aware model checking for fast discovery of deep bugs in cloud systems. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI 2014, pp. 399–414. USENIX Association, Berkeley (2014). https://dl.acm.org/doi/10.5555/2685048.2685080

  33. Lukman, J.F., et al.: Flymc: highly scalable testing of complex interleavings in distributed systems. In: Proceedings of the Fourteenth EuroSys Conference 2019, pp. 1–16 (2019). https://doi.org/10.1145/3302424.3303986

  34. Marić, O., Sprenger, C., Basin, D.: Cutoff bounds for consensus algorithms. In: Majumdar, R., Kunčak, V. (eds.) CAV 2017. LNCS, vol. 10427, pp. 217–237. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63390-9_12

    Chapter  Google Scholar 

  35. Medeiros, A.: ZooKeeper’s atomic broadcast protocol: theory and practice (2012). https://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf

  36. Moraru, I., Andersen, D.G., Kaminsky, M.: There is more consensus in egalitarian parliaments. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 358–372 (2013). https://doi.org/10.1145/2517349.2517350

  37. Newcombe, C., Rath, T., Zhang, F., Munteanu, B., Brooker, M., Deardeuff, M.: How amazon web services uses formal methods. Commun. ACM 58(4), 66–73 (2015). https://doi.org/10.1145/2699417

    Article  Google Scholar 

  38. Ongaro, D., Ousterhout, J.: In search of an understandable consensus algorithm. In: Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC 2014, pp. 305–320. USENIX Association, Berkeley (2014). https://dl.acm.org/doi/10.5555/2643634.2643666

  39. Ozkan, B.K., Majumdar, R., Niksic, F., Befrouei, M.T., Weissenbacher, G.: Randomized testing of distributed systems with probabilistic guarantees. Proc. ACM Program. Lang. 2(OOPSLA), 1–28 (2018). https://doi.org/10.1145/3276530

  40. Ozkan, B.K., Majumdar, R., Oraee, S.: Trace aware random testing for distributed systems. Proc. ACM Program. Lang. 3(OOPSLA), 1–29 (2019). https://doi.org/10.1145/3360606

  41. Wang, D., Dou, W., Gao, Y., Wu, C., Wei, J., Huang, T.: Model checking guided testing for distributed systems. In: Proceedings of the Eighteenth European Conference on Computer Systems, pp. 127–143 (2023). https://doi.org/10.1145/3552326.3587442

  42. Yin, J.Q., Zhu, H.B., Fei, Y.: Specification and verification of the Zab protocol with TLA\(^+\). J. Comput. Sci. Technol. 35, 1312–1323 (2020). https://doi.org/10.1007/s11390-020-0538-7

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ouyang, L., Huang, Y., Huang, B., Ma, X. (2024). Leveraging TLA\(^+\) Specifications to Improve the Reliability of the ZooKeeperCoordination Service. In: Hermanns, H., Sun, J., Bu, L. (eds) Dependable Software Engineering. Theories, Tools, and Applications. SETTA 2023. Lecture Notes in Computer Science, vol 14464. Springer, Singapore. https://doi.org/10.1007/978-981-99-8664-4_11

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8664-4_11

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8663-7

  • Online ISBN: 978-981-99-8664-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics