Abstract
A key-value store is an essential component that is increasingly demanded in many scale-out environments, including social networks, online retail environments, and cloud services. Modern key-value storage engines provide many features, including transaction, versioning, and replication. In storage engines, transaction processing provides atomicity and durability by using write-ahead logging (WAL), which flushes log data before the data page is written to persistent storage in synchronous commit. However, according to our observation, WAL is a performance bottleneck in key-value storage engines since the flushing of log data to persistent storage incurs a significant overhead of lock contention and fsync() calls, even with the various optimizations in the existing scheme. In this article, we propose an approach to improve the performance of key-value storage by optimizing the existing flushing scheme combined with group commit and consolidate array. Our scheme aggregates the multiple flushing of log data into a large request on the fly and completes the request early. This scheme is an efficient group commit that reduces the number of frequent lock acquisitions and fsync() calls in the synchronous commit while supporting the same transaction level that the existing scheme provides. Furthermore, we integrate our flushing scheme into the replication system and evaluate it by using multiple nodes. We implement our scheme on the WiredTiger storage engine. The experimental results show that our scheme improves the performance of the key-value workload compared to the existing scheme.
Similar content being viewed by others
Notes
A leader is the first thread to acquire a log buffer.
The offset is its own LSN.
The group size is the size of the total joined log records in the slot.
The completion flag denotes whether the LSN is completed or not.
Replication can be broadly classified into two categories, such as synchronous and asynchronous replication. In synchronous replication, the transactions are committed simultaneously in all nodes. The master and the slave always remain synchronized. Thus, the data is guaranteed to be consistent in all nodes when the transaction is committed. Meanwhile, in asynchronous replication, the transactions are committed at the master server first and then they are replicated to the slave. This means that the master and slave may not be consistent. The advantage of using asynchronous replication is that it is faster and scales better than synchronous replication. However, the data is not guaranteed to be consistent in all nodes.
References
Aguilera, M.K., Leners, J.B., Walfish, M.: Yesquel: scalable sql storage for web applications. In: Proceedings of the 25th Symposium on Operating Systems Principles, ACM, pp. 245–262 (2015)
Arulraj, J., Perron, M., Pavlo, A.: Write-behind logging. Proc. VLDB Endow. 10(4), 337–348 (2016)
Atikoglu, B., Xu, Y., Frachtenberg, E., Jiang, S., Paleczny, M.: Workload analysis of a large-scale key-value store. ACM SIGMETRICS Perform. Eval. Rev. 40, 53–64 (2012)
Banker, K.: MongoDB in Action. Manning Publications Co., Greenwich (2011)
Bernstein, P.A., Hadzilacos, V., Goodman, N.: Currency Control and Recovery in Database Systems. Addison-Wesley, Reading (1987)
Carlson, J.L.: Redis in Action. Manning Publications Co., Greenwich (2013)
Chen, S.: Flashlogging: exploiting flash devices for synchronous logging performance. In: SIGMOD, New York, NY, USA, SIGMOD’09, ACM, pp. 73–86 (2009)
Cloud, A.E.C.: Amazon web services (2011). Accessed on 9 November 2011
Felber, P., Pasin, M., Rivière, É., Schiavoni, V., Sutra, P., Coelho, F., Oliveira, R., Matos, M., Vilaça, R.: On the support of versioning in distributed key-value stores. In: 2014 IEEE 33rd International Symposium on Reliable Distributed Systems (SRDS), IEEE, pp. 95–104 (2014)
Fitzpatrick, B.: Distributed caching with memcached. Linux J. 2004(124), 5 (2004)
Fruhwirt, P., Kieseberg, P., Schrittwieser, S., Huber, M., Weippl, E.: Innodb database forensics: reconstructing data manipulation queries from redo logs. In: 2012 Seventh International Conference on Availability, Reliability and Security (ARES) (2012)
Gao, S., Xu, J., He, B., Choi, B., Hu, H.: Pcmlogging: reducing transaction logging overhead with pcm. In: 20th ACM International Conference on Information and Knowledge Management, New York, NY, USA, CIKM’11, ACM, pp. 2401–2404 (2011)
Goel, S., Buyya, R.: Data replication strategies in wide-area distributed systems. In: Enterprise Service Computing: From Concept to Deployment. IGI Global, pp. 211–241 (2007)
Gray, J., Reuter, A.: Transaction Processing: Concepts and Techniques. Elsevier, San Francisco (1992)
Han, J., Haihong, E., Le, G., Du, J.: Survey on NoSQL database. In: 2011 6th International Conference on Pervasive Computing and Applications (ICPCA), IEEE, pp. 363–366 (2011)
Helland, P., et al. Group commit timers and high volume transaction systems. In: High Performance Transaction Systems. Springer, New York, pp. 301–329 (1989)
Huang, J., Schwan, K., Qureshi, M.K.: Nvram-aware logging in transaction systems. Proc. VLDB Endow. 8(4), 389–400 (2014)
Johnson, R., Pandis, I., Stoica, R., Athanassoulis, M., Ailamaki, A.: Aether: a scalable approach to logging. Proc. VLDB Endow. 3(1–2), 681–692 (2010)
Kang, W.-H., Lee, S.-W., Moon, B., Kee, Y.-S., Oh, M.: Durable write cache in flash memory ssd for relational and nosql databases. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14 (2014)
Kang, W.-H., Lee, S.-W., Moon, B., Oh, G.-H., Min, C.: X-FTL: transactional FTL for SQLite databases. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, SIGMOD’13, ACM, pp. 97–108 (2013)
Kopytov, A.: Sysbench: a system performance benchmark. http://sysbench.sourceforge.net (2004)
Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44(2), 35–40 (2010)
Lee, S.-W., Moon, B., Park, C., Kim, J.-M., Kim, S.-W.: A case for flash memory ssd in enterprise database applications. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, ACM, pp. 1075–1086 (2008)
Mathur, A., Cao, M., Bhattacharya, S., Dilger, A., Tomas, A., Vivier, L., Bull S.A.S: A and viver, l. the new ext4 filesystem: current status and future plans. In: Ottawa Linux Symposium. http://ols.108.redhat.com/2007/ Reprints/mathur-Reprint.pdf (2007)
Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., Schwarz, P.: Aries: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Trans. Database Syst. (TODS) 17(1), 94–162 (1992)
MongoDB: https://www.mongodb.com/press/wired-tiger (2014)
NVM express: http://www.nvmexpress.org (2012)
Oh, G., Seo, C., Mayuram, R., Kee, Y.-S., Lee, S.-W.: SHARE interface in flash storage for relational and NoSQL databases. In: Proceedings of the 2016 International Conference on Management of Data, New York, NY, USA, SIGMOD’16, ACM, pp. 343–354 (2016)
Ouyang, X., Nellans, D., Wipfel, R., Flynn, D., Panda, D.K.: Beyond block I/O: rethinking traditional storage primitives. In: 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA), IEEE, pp. 301–311 (2011)
Ramakrishnan, R., Gehrke, J.: Database Management Systems. Osborne/McGraw-Hill, Berkeley (2000)
SAMSUNG 843Tn Data Center Series: http://www.samsung.com/semiconductor/global/file/insight/2015/08/PSG2014_2H_FINAL-1.pdf
Samsung: XS1715 Ultra-fast Enterprise Class. http://www.samsung.com/global/business/semiconductor/file/product/XS1715_ProdOverview_2014_1.pdf (2014)
Sivasubramanian, S.: Amazon dynamodb: a seamlessly scalable non-relational database service. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, ACM, pp. 729–730 (2012)
Son, Y., Kang, H., Han, H., Yeom, H.Y.: An empirical evaluation and analysis of the performance of nvm express solid state drive. Cluster Comput. 19, 1–13 (2016)
Son, Y., Kang, H., Han, H., and Yeom, H.Y.: Improving performance of cloud key-value storage using flushing optimization. In: 2016 IEEE 1st International Workshops on Foundations and Applications of Self* Systems (FAS*W), pp. 42–47 (2016)
Son, Y., Yeom, H., Han, H.: Optimizing i/o operations in file systems for fast storage devices. IEEE Trans. Comput. 66, 1071–1084 (2016)
Song, N.Y., Son, Y., Han, H., Yeom, H.Y.: Efficient memory-mapped i/o on fast storage device. ACM Trans. Storage 19:1(19:27), 12–4 (2016)
Sumbaly, R., Kreps, J., Gao, L., Feinberg, A., Soman, C., Shah, S.: Serving large-scale batch computed data with project voldemort. In: Proceedings of the 10th USENIX Conference on File and Storage Technologies, USENIX Association, p. 18 (2012)
Wang, T., Johnson, R.: Scalable logging through emerging non-volatile memory. Proc. VLDB Endow. 7, 10 (2014)
WiredTiger: http://www.wiredtiger.com (2014)
Acknowledgements
This research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (2015M3C4A7065581, 2015M3C4A7065645). Prof. Han’s work was partly supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2014R1A1A2055032).
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version [35] of this article was presented at the 1st IEEE Internaltional Workshops on Foundations and Applications of Self* Systems, Augsburg, Germany, September 2016.
Rights and permissions
About this article
Cite this article
Son, Y., Yeom, H.Y. & Han, H. Design and implementation of an efficient flushing scheme for cloud key-value storage. Cluster Comput 20, 3551–3563 (2017). https://doi.org/10.1007/s10586-017-1101-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-1101-3