Abstract
Large scale distributed databases are designed to support commercial and cloud based applications. The minimal expectation from such systems is that they ensure consistency and reliability in case of node failures. The distributed database guarantees reliability through the use of atomic commitment protocols. Atomic commitment protocols help in ensuring that either all the changes of a transaction are applied or none of them exist. To ensure efficient commitment process, the database community has mainly used the two-phase commit (2PC) protocol. However, the 2PC protocol is blocking under multiple failures. This necessitated the development of non-blocking, three-phase commit (3PC) protocol. However, the database community is still reluctant to use the 3PC protocol, as it acts as a scalability bottleneck in the design of efficient transaction processing systems. In this work, we present EasyCommit protocol which leverages the best of both worlds (2PC and 3PC), that is non-blocking (like 3PC) and requires two phases (like 2PC). EasyCommit achieves these goals by ensuring two key observations: (i) first transmit and then commit, and (ii) message redundancy. We present the design of the EasyCommit protocol and prove that it guarantees both safety and liveness. We also present a detailed evaluation of EC protocol and show that it is nearly as efficient as the 2PC protocol. To cater the needs of geographically large scale distributed systems we also design a topology-aware agreement protocol (Geo-scale EasyCommit) that is non-blocking, safe, live and outperforms 3PC protocol.


























Similar content being viewed by others
Notes
The coordinating node is the one which initiates the commit protocol, and in this work it is also the node which receives the client request to execute a transaction.
Partitioned database is the terminology used by the database community to refer to the shared-nothing distributed databases, and should not be intermixed with the term network partitioning.
The term cohort refers to a participating node in the transaction commit process. We use these terms interchangeably.
INITAL, READY and WAIT states are considered as non-committable states.
Without node failures, any transaction that reaches the prepare phase is assumed to successfully commit.
References
Abbadi, A.E., Toueg, S.: Maintaining availability in partitioned replicated databases. ACM Trans Database Syst 14(2), 264–290 (1989). https://doi.org/10.1145/63500.63501
Abdallah, M., Guerraoui, R., Pucheral, P.: One-phase commit: does it make sense? ICPADS (1998)
Agrawal, D., El Abbadi, A., Mahmoud, H.A., Nawab, F., Salem, K.: Managing geo-replicated data in multi-datacenters. In: Proceedings of the 2013 Databases in Networked Information Systems—8th International Workshop, DNIS’13, pp. 23–43 (2013)
Amir, Y., Danilov, C., Dolev, D., Kirsch, J., Lane, J., Nita-Rotaru, C., Olsen, J., Zage, D.: Steward: scaling byzantine fault-tolerant replication to wide area networks. IEEE Trans. Dependable Secur. Comput. 7(1), 80–93 (2010). https://doi.org/10.1109/TDSC.2008.53
Bailis, P., Davidson, A., Fekete, A., Ghodsi, A., Hellerstein, J.M., Stoica, I.: Highly available transactions: virtues and limitations. Proc VLDB Endow 7(3), 181–192 (2013)
Bailis, P., Fekete, A., Ghodsi, A., Hellerstein, J.M., Stoica, I.: Scalable atomic visibility with RAMP transactions. ACM Trans Database Syst 41(3), 15 (2016)
Baker, J., Bond, C., Corbett, J.C., Furman, J., Khorlin, A., Larson, J., Leon, J.M., Li, Y., Lloyd, A., Yushprakh, V.: Megastore: providing scalable, highly available storage for interactive services. In: Proceedings of the Conference on Innovative Data system Research (CIDR), pp. 223–234 (2011)
Bernstein, P.A., Goodman, N.: Concurrency control in distributed database systems. ACM Comput Surv 13(2), 185–221 (1981)
Bernstein, P.A., Goodman, N.: Multiversion concurrency control—theory and algorithms. ACM TODS 8(4), 465–483 (1983)
Bernstein, P.A., Goodman, N.: An algorithm for concurrency control and recovery in replicated distributed databases. ACM Trans Database Syst 9(4), 596–615 (1984). https://doi.org/10.1145/1994.2207
Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in Database Systems. Addison-Wesley Longman Publishing Co., Inc., Boston, MA (1987a)
Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in Database Systems. Addison-Wesley Longman Publishing Co., Boston, MA (1987b)
Boutros, B.S., Desai, B.C.: A two-phase commit protocol and its performance. In: IEEE, DEXA, pp. 100–105 (1996)
Chen, K., Zhou, Y., Cao, Y.: Online data partitioning in distributed database systems. In: Proceedings of the 18th International Conference on Extending Database Technology, OpenProceeding.org, pp. 1–12 (2015)
CockroachDB (2018). https://www.cockroachlabs.com/
Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM Symposium on Cloud Computing, ACM, pp. 143–154 (2010)
Corbett, J.C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., Woodford, D.: Spanner: Google’s globally-distributed database. In: 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), USENIX Association, pp. 261–264 (2012)
Council TPP (2010) Tpc benchmark c (revision 5.11)
Diaconu, C., Freedman, C., Ismert, E., Larson, P.A., Mittal, P., Stonecipher, R., Verma, N., Zwilling, M.: Hekaton: SQL Server’s Memory-optimized OLTP Engine. ACM, pp. 1243–1254 (2013)
Dutta, P., Guerraoui, R., Pochon, B.: Fast non-blocking atomic commit: an inherent trade-off. Inf Process Lett 91(4), 195–200 (2004)
El Abbadi, A., Skeen, D., Cristian, F.: An efficient, fault-tolerant protocol for replicated data management. In: Proceedings of the Fourth ACM SIGACT-SIGMOD Symposium on Principles of Database Systems, ACM, New York, PODS ’85, pp 215–229 (1985). https://doi.org/10.1145/325405.325443
Freels, M.: FaunaDB: an architectural overview (2018)
Fung, B.: The embarrassing reason behind Amazons huge cloud computing outage this week. The Washington Post, Washington, DC (2017)
Gawlick, D., Kinkade, D.: Varieties of concurrency control in IMS/VS fast path. IEEE Database Eng. Bull. 8, 3–10 (1985)
Gifford, D.K.: Weighted voting for replicated data. In: Proceedings of the Seventh ACM Symposium on Operating Systems Principles, ACM, New York, NY, SOSP ’79, pp 150–162 (1979). https://doi.org/10.1145/800215.806583
Gray, J.: Notes on data base operating systems. In: Operating Systems, An Advanced Course. Springer, Berlin, pp. 393–481 (1978)
Gray, J.: The transaction concept: virtues and limitations (invited paper). In: VLDB, pp. 144–154 (1981)
Gray, J.: A Comparison of the Byzantine Agreement Problem and the Transaction Commit Problem, pp. 10–17. Springer, New York (1990)
Gray, J., Lamport, L.: Consens. Trans. Commit. ACM TODS 31(1), 133–160 (2006)
Gray, J., Reuter, A.: Transaction Processing: Concepts and Techniques, 1st edn. Morgan Kaufmann Publishers Inc., Burlington (1992)
Guerraoui, R.: Revisiting the Relationship Between Non-blocking Atomic Commitment and Consensus, pp. 87–100. Springer, Berlin (1995)
Guerraoui, R., Larrea, M., Schiper, A.: Reducing the Cost for Non-blocking in Atomic Commitment. In: IEEE Proceedings of 16th International Conference on Distributed Computing Systems, pp. 692–697 (1996)
Gupta, S., Sadoghi, M.: Blockchain Transaction Processing, pp. 1–11. Springer, Cham (2018a)
Gupta, S., Sadoghi, M.: EasyCommit: A non-blocking two-phase commit protocol. In: Proceedings of the 21st International Conference on Extending Database Technology, Open Proceedings, EDBT (2018b)
Harding, R., Van Aken, D., Pavlo, A., Stonebraker, M.: An evaluation of distributed concurrency control. Proc VLDB Endow 10(5), 553–564 (2017)
Haritsa, J.R., Ramamritham, K., Gupta, R.: The PROMPT real-time commit protocol. IEEE TPDS 11(2), 160–181 (2000)
Herlihy, M.P., Wing, J.M.: Linearizability: a correctness condition for concurrent objects. ACM TOPLAS 12(3), 463–492 (1990)
Jiménez-Peris, R., Patiño Martínez, M., Alonso, G., Arévalo, S.: A low-latency non-blocking commit service. Springer, Berlin DISC’01 (2001)
Kallman, R., Kimura, H., Natkins, J., Pavlo, A., Rasin, A., Zdonik, S.B., Jones, E.P.C., Madden, S., Stonebraker, M., Zhang, Y., Hugg, J., Abadi, D.J.: H-store: a high-performance, distributed main memory transaction processing system. PVLDB 1, 1496–1499 (2008)
Lamport, L.: The part-time parliament. ACM Trans Comput Syst 16(2), 133–169 (1998)
Levy, E., Korth, H.F., Silberschatz, A.: An optimistic commit protocol for distributed transaction management. In: ACM SIGMOD, ACM, pp. 88–97 (1991)
Lin, Q., Chang, P., Chen, G., Ooi, B.C., Tan, K.L., Wang, Z.: Towards a non-2PC transaction management in distributed database systems. In: Proceedings of the 2016 International Conference on Management of Data, ACM, New York, NY, SIGMOD ’16, pp 1659–1674 (2016). https://doi.org/10.1145/2882903.2882923
Lloyd, W., Freedman, M.J., Kaminsky, M., Andersen, D.G.: Stronger semantics for low-latency geo-replicated storage. In: USENIX Association, NSDI, pp. 313–328 (2013)
Mahmoud, H., Nawab, F., Pucher, A., Agrawal, D., El Abbadi, A.: Low-latency multi-datacenter databases using replicated commit. Proc VLDB Endow 6(9), 661–672 (2013). https://doi.org/10.14778/2536360.2536366
Mahmoud, H.A., Arora, V., Nawab, F., Agrawal, D., El Abbadi, A.: MaaT: effective and scalable coordination of distributed transactions in the cloud. Proc VLDB Endow 7(5), 329–340 (2014). https://doi.org/10.14778/2732269.2732270
Mao, Y., Junqueira, F.P., Marzullo, K.: Mencius: building efficient replicated state machines for WANs. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, pp. 369–384 (2008)
MemSQL (2013). http://www.memsql.com
Mohan, C., Lindsay, B., Obermarck, R.: Transaction management in the R* distributed database management system. ACM TODS 11(4), 378–396 (1986)
Nawab, F., Sadoghi, M.: Blockplane: A global-scale byzantizing middleware. In: Proceedings of the 35th IEEE International Conference on Data Engineering, IEEE, ICDE ’19 (2019)
Nawab, F., Arora, V., Agrawal, D., El Abbadi, A.: Minimizing commit latency of transactions in geo-replicated data stores. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, ACM, SIGMOD ’15, pp 1279–1294 (2015)
NuoDB (2010). http://www.nuodb.com
O’Brien, S.A.: Facebook. Instagram experience outages Saturday. CNN, GA, USA (2017)
Ongaro, D., Ousterhout, J.: In search of an understandable consensus algorithm. In: Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX Association, USENIX ATC’14, pp. 305–320 (2014)
Oracle, C.: Oracle 9i real application clusters concepts release 2 (9.2), Part Number A96597-01 (2002)
Ozsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 3rd edn. Springer, New York (2011)
Park, T., Yeom, H.Y.: A distributed group commit protocol for distributed database systems. ICPADS (1991)
Patterson, S., Elmore, A.J., Nawab, F., Agrawal, D., El Abbadi, A.: Serializability, not serial: concurrency control and availability in multi-datacenter datastores. Proc VLDB Endow 5(11), (2012)
Pavlo, A., Curino, C., Zdonik, S.: Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In: ACM, SIGMOD ’12, pp. 61–72 (2012)
Peng, D., Dabek, F.: Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, Berkeley, CA, OSDI’10, pp. 251–264 (2010)
Qadah, T.M., Sadoghi, M.: QueCC: a queue-oriented, control-free concurrency architecture. In: Proceedings of the 19th International Middleware Conference, ACM, New York, NY, Middleware ’18, pp 13–25, (2018). https://doi.org/10.1145/3274808.3274810
Reddy, P.K., Kitsuregawa, M.: Reducing the blocking in two-phase commit protocol employing backup sites. In: IEEE, COOPIS’98, pp. 406–416 (1998)
Sadoghi, M., Blanas, S.: Transaction processing on modern hardware. Synth. Lect. Data Manag. 14(2), 1–138 (2019). https://doi.org/10.2200/S00896ED1V01Y201901DTM058
Sadoghi, M., Ross, K.A., Canim, M., Bhattacharjee, B.: Making updates disk-I/O friendly using SSDs. Proc VLDB Endow 6(11), 997–1008 (2013)
Sadoghi, M., Canim, M., Bhattacharjee, B., Nagel, F., Ross, K.A.: Reducing database locking contention through multi-version concurrency. Proc VLDB Endow 7(13), 1331–1342 (2014)
Sadoghi, M., Bhattacherjee, S., Bhattacharjee, B., Canim, M.: L-Store: A real-time OLTP and OLAP system (2018). http://www.OpenProceeding.org, EDBT
Samaras, G., Britton, K., Citron, A., Mohan, C.: Two-phase commit optimizations in a commercial distributed environment. Distrib. Parallel Databases 3(4), 325–360 (1995)
Shute, J., Vingralek, R., Samwel, B., Handy, B., Whipkey, C., Rollins, E., Oancea, M., Littleeld, K., Menestrina, D., Ellner, S., Apte, H.: F1: A distributed sql database that scales. In: VLDB (2013)
Skeen, D.: Nonblocking commit protocols. In: ACM, SIGMOD, pp. 133–142 (1981)
Skeen, D.: A quorum-based commit protocol. Tech. rep. (1982)
Skeen, D., Stonebraker, M.: A formal model of crash recovery in a distributed system. IEEE Trans. Softw. Eng. 9(3), 219–228 (1983)
Stamos, J., Cristian, F.: A low-cost atomic commit protocol. In: Proceedings of the 9th Symposium on Reliable Distributed Systems, IEEE, pp. 10–17 (1990)
Stonebraker, M.: Concurrency control and consistency of multiple copies of data in distributed ingres. IEEE Trans. Softw. Eng. SE–5(3), 188–194 (1979). https://doi.org/10.1109/TSE.1979.234180
Stonebraker, M.: The case for shared nothing. Database Eng. 9, 4–9 (1986)
Sulleyman, A.: Twitter down: social media app and website not working. The Independent, UK (2017)
Thomson, A., Diamond, T., Weng, S.C., Ren, K., Shao, P., Abadi, D.J.: Calvin: fast distributed transactions for partitioned database systems. In: SIGMOD (2012)
TiDB (2018). https://pingcap.com/en/
VoltDB (2010). https://www.voltdb.com/
Acknowledgements
We would like to acknowledge Thamir Qadah for the valuable discussions that helped us to design ExpoDB system. Further, we acknowledge the anonymous reviewers for their useful inputs and comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gupta, S., Sadoghi, M. Efficient and non-blocking agreement protocols. Distrib Parallel Databases 38, 287–333 (2020). https://doi.org/10.1007/s10619-019-07267-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-019-07267-w