Abstract
The demand for data analytics has been consistently increasing in the past years at Twitter. In order to fulfill the requirements and provide a highly scalable and available query experience, a large-scale in-house SQL system is heavily relied on. Recently, we evolved the SQL system into a hybrid-cloud SQL federation system, compliant with Twitter’s Partly Cloudy strategy. The hybrid-cloud SQL federation system is capable of processing queries across Twitter’s data centers and the public cloud, interacting with around 10PB of data per day.
In this paper, the design of the hybrid-cloud SQL federation system is presented, which consists of query, cluster, and storage federations. We identify challenges in a modern SQL system and demonstrate how our system addresses them with some important design decisions. We also conduct qualitative examinations and summarize instructive lessons learned from the development and operation of such a SQL system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
At Twitter, SQL system users cannot create or update datasets except exclusive temporary tables under personal accounts. Due to the requirements for data lineage and governance, only data pipeline system accounts have write access to public datasets. SQL individual users only have read access.
- 2.
From an analysis of a typical Twitter OLAP workload in three months, 19.2% of queries consume more than 1 TB peak memory.
References
Aurora configuration (2017). http://aurora.apache.org/documentation/latest/reference/configuration-tutorial/
Apache Beam SQL (2021). https://beam.apache.org/documentation/dsls/sql/overview/
Apache Druid SQL (2021). https://druid.apache.org/docs/latest/querying/sql.html
Apache Zeppelin (2021). https://zeppelin.apache.org/
Google BigQuery (2021). https://cloud.google.com/bigquery
Hadoop ViewFs (2021). https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ViewFs.html
Helium packages (2021). https://zeppelin.apache.org/helium_packages.html
Jupyter project (2021). https://jupyter.org/
TPC-H benchmark (2021). http://www.tpc.org/tpch/
Agrawal, P.: A new collaboration with Google Cloud (2018). https://blog.twitter.com/engineering/en_us/topics/infrastructure/2018/a-new-collaboration-with-google-cloud.html
Aguilar-Saborit, J., et al.: POLARIS: the distributed SQL engine in azure synapse. Proc. VLDB Endow. 13(12), 3204–3216 (2020)
Aleyasen, A., Soliman, M.A., Antova, L., Waas, F.M., Winslett, M.: High-throughput adaptive data virtualization via context-aware query routing. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 1709–1718. IEEE (2018)
Armbrust, M., et al.: Spark SQL: relational data processing in Spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394 (2015)
Barga, R.: Hadoop filesystem at Twitter (2015). https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter
Chattopadhyay, B., et al.: Procella: unifying serving and analytical data at YouTube. Proc. VLDB Endow. 12(12), 2022–2034 (2019)
Dageville, B., et al.: The snowflake elastic data warehouse. In: Proceedings of the 2016 International Conference on Management of Data, pp. 215–226 (2016)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Dem, J.L.: Graduating apache parquet (2015). https://blog.twitter.com/engineering/en_us/a/2015/graduating-apache-parquet.html
Gupta, A., et al.: Amazon redshift and the case for simpler data warehouses. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1917–1923 (2015)
Hashemi, M.: The infrastructure behind Twitter: efficiency and optimization (2016). https://blog.twitter.com/engineering/en_us/topics/infrastructure/2016/the-infrastructure-behind-twitter-efficiency-and-optimization
Hindman, B., et al.: Mesos: a platform for fine-grained resource sharing in the data center. In: NSDI, vol. 11, p. 22 (2011)
Krishnan, S.: Discovery and consumption of analytics data at Twitter (2016). https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html
Lamb, A., et al.: The vertica analytic database: C-store 7 years later. Proc. VLDB Endow. 5(12), 1790–1801 (2012)
Lawrence, R.: Integration and virtualization of relational SQL and NoSQL systems including MySQL and MongoDB. In: 2014 International Conference on Computational Science and Computational Intelligence, vol. 1, pp. 285–290. IEEE (2014)
Lawrence, R.: Faster querying for database integration and virtualization with distributed semi-joins. In: 2017 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1406–1410. IEEE (2017)
Li, Y., et al.: A performance evaluation of spark graphframes for fast and scalable graph analytics at Twitter. In: 2021 IEEE International Conference on Big Data (Big Data), pp. 5959–5959. IEEE (2021)
Luo, Z., et al.: From batch processing to real time analytics: running presto at scale. In: 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE (2022) (in press)
Mami, M.N., Graux, D., Scerri, S., Jabeen, H., Auer, S., Lehmann, J.: Uniform access to multiform data lakes using semantic technologies. In: Proceedings of the 21st International Conference on Information Integration and Web-Based Applications & Services, pp. 313–322 (2019)
Melnik, S., et al.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow. 3(1–2), 330–339 (2010)
Melnik, S., et al.: Dremel: a decade of interactive SQL analysis at web scale. Proc. VLDB Endow. 13(12), 3461–3472 (2020)
Mousa, A.H., Shiratuddin, N.: Data warehouse and data virtualization comparative study. In: 2015 International Conference on Developments of E-Systems Engineering (DeSE), pp. 369–372. IEEE (2015)
Mucchetti, M.: BigQuery ML. In: Mucchetti, M. (ed.) BigQuery for Data Warehousing, pp. 419–468. Springer, Berkeley (2020). https://doi.org/10.1007/978-1-4842-6186-6_19
Rottinghuis, J.: Partly Cloudy: the start of a journey into the cloud (2019). https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/the-start-of-a-journey-into-the-cloud.html
Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., Wilkes, J.: Omega: flexible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 351–364 (2013)
Sethi, R., et al.: Presto: SQL on everything. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1802–1813. IEEE (2019)
Tan, J., et al.: Choosing a cloud DBMS: architectures and tradeoffs. Proc. VLDB Endow. 12(12), 2170–2182 (2019)
Tang, C., et al.: Twine: a unified cluster management system for shared infrastructure. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020), pp. 787–803 (2020)
Tang, C., et al.: Taming hybrid-cloud fast and scalable graph analytics at Twitter. arXiv preprint arXiv:2204.11338 (2022)
Tang, C., et al.: Forecasting SQL query cost at Twitter. In: 2021 IEEE International Conference on Cloud Engineering (IC2E), pp. 154–160. IEEE (2021)
Tang, C., et al.: Hybrid-cloud SQL federation system at Twitter. In: ECSA (Companion) (2021)
Thusoo, A., et al.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)
Tirmazi, M., et al.: Borg: the next generation. In: Proceedings of the Fifteenth European Conference on Computer Systems, pp. 1–14 (2020)
Vathy-Fogarassy, Á., Hugyák, T.: Uniform data access platform for SQL and NoSQL database systems. Inf. Syst. 69, 93–105 (2017)
Vavilapalli, V.K., et al.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, pp. 1–16 (2013)
Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., Wilkes, J.: Large-scale cluster management at Google with Borg. In: Proceedings of the Tenth European Conference on Computer Systems, pp. 1–17 (2015)
VijayaRenu, L., Wang, Z., Rottinghuis, J.: Scaling event aggregation at Twitter to handle billions of events per minute. In: 2020 IEEE Infrastructure Conference, pp. 1–4. IEEE (2020)
Wei, C., et al.: AnalyticDB-V: a hybrid analytical engine towards query fusion for structured and unstructured data. Proc. VLDB Endow. 13(12), 3152–3165 (2020)
Wu, H., et al.: Migrate on-premises real-time data analytics jobs into the cloud. In: 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA), pp. 1–2. IEEE (2021)
Wu, H., et al.: Move real-time data analytics to the cloud: a case study on heron to dataflow migration. In: 2021 IEEE International Conference on Big Data (Big Data), pp. 2064–2067. IEEE (2021)
Zhan, C., et al.: AnalyticDB: real-time OLAP database system at Alibaba cloud. Proc. VLDB Endow. 12(12), 2059–2070 (2019)
Acknowledgment
Twitter’s SQL federation system is a complicated project that has evolved for years. We would like to express our gratitude to everyone who has served on Twitter’s Interactive Query team, including former team members Hao Luo, Yaliang Wang, Da Cheng, Fred Dai, and Maosong Fu. We also appreciate Prateek Mukhedkar, Vrushali Channapattan, Daniel Lipkin, Derek Lyon, Srikanth Thiagarajan, Jeremy Zogg, and Sudhir Srinivas for their strategic vision, direction, and support to the team. Finally, we thank Erica Hessel, Alex Angarita Rosales, and the anonymous ECSA reviewers for their informative comments, which considerably improved our paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tang, C. et al. (2022). Serving Hybrid-Cloud SQL Interactive Queries at Twitter. In: Scandurra, P., Galster, M., Mirandola, R., Weyns, D. (eds) Software Architecture. ECSA 2021. Lecture Notes in Computer Science, vol 13365. Springer, Cham. https://doi.org/10.1007/978-3-031-15116-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-15116-3_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15115-6
Online ISBN: 978-3-031-15116-3
eBook Packages: Computer ScienceComputer Science (R0)