Abstract

This paper argues that the data warehouse architecture as we know it today will wither in the coming years and be replaced by a new architectural pattern, the Lakehouse, which will (i) be based on open direct-access data formats, such as Apache Parquet, (ii) have first class support for machine learning and data science, and (iii) offer state-of-the-art performance. Lakehouses can help address several major challenges with data warehouses, including data staleness, reliability, total cost of ownership, data lock-in, and limited use-case support. We discuss how the industry is already moving toward Lakehouses and how this shift may affect work in data management. We also report results from a Lakehouse system using Parquet that is competitive with popular cloud data warehouses on TPC-DS.
本文认为,我们今天所知的数据仓库架构将在未来几年消亡,取而代之的是一种新的架构模式Lakehouse,它将(i)基于开放的直接访问数据格式,如Apache Parquet,(ii)对机器学习和数据科学具有一流的支持,以及(iii)提供最先进的性能。Lakehouses可以帮助解决数据仓库的几个主要挑战,包括数据的陈旧性、可靠性、总体拥有成本、数据锁定和有限的用例支持。我们讨论了该行业是如何向Lakehouses发展的,以及这种转变可能会如何影响数据管理工作。我们还报告了使用Parquet的Lakehouse系统的结果,该系统与TPC-DS上流行的云数据仓库具有竞争力。

Introduction

This paper argues that the data warehouse architecture as we know it today will wane in the coming years and be replaced by a new architectural pattern, which we refer to as the Lakehouse, characterized by (i) open direct-access data formats, such as Apache Parquet and ORC, (ii) first-class support for machine learning and data science workloads, and (iii) state-of-the-art performance.
本文认为,我们今天所知的数据仓库架构将在未来几年衰落,取而代之的是一种新的架构模式,我们称之为Lakehouse,其特点是(i)开放的直接访问数据格式,如Apache Parquet和ORC,(ii)对机器学习和数据科学工作负载的一流支持,以及(iii)最先进的性能。
The history of data warehousing started with helping business leaders get analytical insights by collecting data from operational databases into centralized warehouses, which then could be used for decision support and business intelligence (BI). Data in these warehouses would be written with schema-on-write, which ensured that the data model was optimized for downstream BI consumption. We refer to this as the first generation data analytics platforms.
数据仓库的历史始于帮助商业领袖通过将运营数据库中的数据收集到集中仓库中来获得分析见解,然后可以用于决策支持和商业智能(BI)。这些仓库中的数据将使用写时模式写入,这确保了数据模型针对下游BI消耗进行了优化。我们称之为第一代数据分析平台。

A decade ago, the first generation systems started to face several challenges. First, they typically coupled compute and storage into an on-premises appliance. This forced enterprises to provision and pay for the peak of user load and data under management, which became very costly as datasets grew. Second, not only were datasets growing rapidly, but more and more datasets were completely unstructured, e.g., video, audio, and text documents, which data warehouses could not store and query at all.
十年前,第一代系统开始面临一些挑战。首先,它们通常将计算和存储耦合到本地设备中。这迫使企业为用户负载和管理中的数据的峰值提供并支付费用,随着数据集的增长,这变得非常昂贵。其次,数据集不仅增长迅速,而且越来越多的数据集是完全非结构化的,例如视频、音频和文本文档,数据仓库根本无法存储和查询这些文档。

To solve these problems, the second generation data analytics platforms started offloading all the raw data into data lakes: low-cost storage systems with a file API that hold data in generic and usually open file formats, such as Apache Parquet and ORC [8, 9]. This approach started with the Apache Hadoop movement [5], using the Hadoop File System (HDFS) for cheap storage. The data lake was a schema-on-read architecture that enabled the agility of storing any data at low cost, but on the other hand, punted the problem of data quality and governance downstream. In this architecture, a small subset of data in the lake would later be ETLed to a downstream data warehouse (such as Teradata) for the most important decision support and BI applications. The use of open formats also made data lake data directly accessible to a wide range of other analytics engines, such as machine learning systems [30, 37, 42].
为了解决这些问题,第二代数据分析平台开始将所有原始数据卸载到数据湖:具有文件API的低成本存储系统,以通用且通常是开放的文件格式保存数据,如Apache Parquet和ORC[8,9]。这种方法始于Apache Hadoop运动[5],使用Hadoop文件系统(HDFS)进行廉价存储。数据湖是一种读模式架构,它能够以低成本灵活地存储任何数据,但另一方面,它将数据质量和治理问题推到了下游。在这种架构中,湖中的一小部分数据稍后将被ETLed到下游数据仓库(如Teradata),用于最重要的决策支持和BI应用程序。开放格式的使用也使数据湖数据可以直接访问到各种其他分析引擎,如机器学习系统[30,37,42]。

From 2015 onwards, cloud data lakes, such as S3, ADLS and GCS, started replacing HDFS. They have superior durability (often >10 nines), geo-replication, and most importantly, extremely low cost with the possibility of automatic, even cheaper, archival storage, e.g., AWS Glacier. The rest of the architecture is largely the same in the cloud as in the second generation systems, with a downstream data warehouse such as Redshift or Snowflake. This two-tier data lake + warehouse architecture is now dominant in the industry in our experience (used at virtually all Fortune 500 enterprises).
从2015年起,云数据湖,如S3、ADLS和GCS,开始取代HDFS。它们具有卓越的耐用性(通常>10个9)、地理复制,最重要的是,成本极低,可以实现自动甚至更便宜的档案存储,例如AWS Glacier。该架构的其余部分在云中与第二代系统中基本相同,具有Redshift或Snowflake等下游数据仓库。根据我们的经验,这种双层数据湖+仓库架构现在在行业中占主导地位(几乎在所有财富500强企业中都使用)。

This brings us to the challenges with current data architectures. While the cloud data lake and warehouse architecture is ostensibly cheap due to separate storage (e.g., S3) and compute (e.g., Redshift), a two-tier architecture is highly complex for users. In the first generation platforms, all data was ETLed from operational data systems directly into a warehouse. In today’s architectures, data is first ETLed into lakes, and then again ELTed into warehouses, creating complexity, delays, and new failure modes. Moreover, enterprise use cases now include advanced analytics such as machine learning, for which neither data lakes nor warehouses are ideal. Specifically, today’s data architectures commonly suffer from four problems:
这给我们带来了当前数据体系结构的挑战。虽然云数据湖和仓库架构由于存储(例如S3)和计算(例如Redshift)分离而表面上很便宜,但双层架构对用户来说非常复杂。在第一代平台中,所有数据都是从操作数据系统直接ETLed到仓库中的。在今天的体系结构中,数据首先被ETLed到湖泊中,然后再次被ELT到仓库中,从而产生复杂性、延迟和新的故障模式。此外,企业用例现在包括机器学习等高级分析,数据湖和仓库都不理想。具体而言,当今的数据体系结构通常存在四个问题:

Reliability. Keeping the data lake and warehouse consistent is difficult and costly. Continuous engineering is required to ETL data between the two systems and make it available to high-performance decision support and BI. Each ETL step also risks incurring failures or introducing bugs that reduce data quality, e.g., due to subtle differences between the data lake and warehouse engines. 保持数据湖和仓库的一致性既困难又昂贵。需要持续的工程来在两个系统之间对数据进行ETL,并使其可用于高性能决策支持和BI。每个ETL步骤都有导致故障或引入降低数据质量的错误的风险,例如,由于数据湖和仓库引擎之间的细微差异。

Data staleness. The data in the warehouse is stale compared to that of the data lake, with new data frequently taking days to load. This is a step back compared to the first generation of analytics systems, where new operational data was immediately available for queries. According to a survey by Dimensional Research and Fivetran, 86% of analysts use out-of-date data and 62% report waiting on engineering resources numerous times per month [47]. 与数据湖的数据相比,仓库中的数据已经过时,加载新数据通常需要几天时间。与第一代分析系统相比,这是一个倒退,在第一代分析中,新的运营数据可以立即用于查询。根据Dimensional Research和Fivetran的一项调查,86%的分析师使用过时的数据,62%的分析师报告每月多次等待工程资源[47]。

Limited support for advanced analytics. Businesses want to ask predictive questions using their warehousing data, e.g., “which customers should I offer discounts to?” Despite much research on the confluence of ML and data management, none of the leading machine learning systems, such as TensorFlow, PyTorch and XGBoost, work well on top of warehouses. Unlike BI queries, which extract a small amount of data, these systems need to process large datasets using complex non-SQL code. Reading this data via ODBC/JDBC is inefficient, and there is no way to directly access the internal warehouse proprietary formats. For these use cases, warehouse vendors recommend exporting data to files, which further increases complexity and staleness (adding a third ETL step!). Alternatively, users can run these systems against data lake data in open formats. However, they then lose rich management features from data warehouses, such as ACID transactions, data versioning and indexing. 企业希望使用其仓储数据提出预测性问题,例如,“我应该向哪些客户提供折扣?”尽管对ML和数据管理的融合进行了大量研究,但没有一个领先的机器学习系统,如TensorFlow、PyTorch和XGBoost,在仓库之上运行良好。与提取少量数据的BI查询不同,这些系统需要使用复杂的非SQL代码来处理大型数据集。通过ODBC/JDBC读取这些数据效率很低,而且无法直接访问内部仓库专有格式。对于这些用例,仓库供应商建议将数据导出到文件中,这会进一步增加复杂性和陈旧性(增加第三个ETL步骤!)。或者,用户可以针对开放格式的数据湖数据运行这些系统。然而,它们会失去数据仓库丰富的管理功能,如ACID事务、数据版本控制和索引。

Total cost of ownership. Apart from paying for continuous ETL, users pay double the storage cost for data copied to a warehouse, and commercial warehouses lock data into proprietary formats that increase the cost of migrating data or workloads to other systems. 除了支付连续ETL的费用外,用户还为复制到仓库的数据支付双倍的存储成本,而商业仓库将数据锁定为专有格式,这增加了将数据或工作负载迁移到其他系统的成本。

Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics_ci


A straw-man solution that has had limited adoption is to eliminate the data lake altogether and store all the data in a warehouse that has built-in separation of compute and storage. We will argue that this has limited viability, as evidenced by lack of adoption, because it still doesn’t support managing video/audio/text data easily or fast direct access from ML and data science workloads.

采用有限的稻草人解决方案是完全消除数据湖,并将所有数据存储在具有内置计算和存储分离功能的仓库中。我们会认为,这一方法的可行性有限,缺乏采用证明了这一点,因为它仍然不支持轻松管理视频/音频/文本数据,也不支持ML和数据科学工作负载的快速直接访问。

In this paper, we discuss the following technical question: is it possible to turn data lakes based on standard open data formats, such as Parquet and ORC, into high-performance systems that can provide both the performance and management features of data warehouses and fast, direct I/O from advanced analytics workloads? We argue that this type of system design, which we refer to as a Lakehouse (Fig. 1), is both feasible and is already showing evidence of success, in various forms, in the industry. As more business applications start relying on operational data and on advanced analytics, we believe the Lakehouse is a compelling design point that can eliminate some of the top challenges with data warehousing.
在本文中,我们讨论了以下技术问题:是否有可能将基于标准开放数据格式(如Parquet和ORC)的数据湖转变为高性能系统,既能提供数据仓库的性能和管理功能,又能提供高级分析工作负载的快速、直接I/O?我们认为,这种类型的系统设计,我们称之为Lakehouse(图1),是可行的,并且已经显示出在行业中以各种形式取得成功的证据。随着越来越多的商业应用程序开始依赖运营数据和高级分析,我们相信Lakehouse是一个引人注目的设计点,可以消除数据仓库的一些最大挑战。

In particular, we believe that the time for the Lakehouse has come due to recent solutions that address the following key problems: 特别是,我们认为湖人队的时机已经到来,因为最近的解决方案解决了以下关键问题

  1. Reliable data management on data lakes: A Lakehouse needs to be able to store raw data, similar to today’s data lakes, while simultaneously supporting ETL/ELT processes that curate this data to improve its quality for analysis. Traditionally, data lakes have managed data as “just a bunch of files” in semi-structured formats, making it hard to offer some of the key management features that simplify ETL/ELT in data warehouses, such as transactions, rollbacks to old table versions, and zero-copy cloning. However, a recent family of systems such as Delta Lake [10] and Apache Iceberg [7] provide transactional views of a data lake, and enable these management features. Of course, organizations still have to do the hard work of writing ETL/ELT logic to create curated datasets with a Lakehouse, but there are fewer ETL steps overall, and analysts can also easily and performantly query the raw data tables if they wish to, much like in first-generation analytics platforms. 数据湖的可靠数据管理:Lakehouse需要能够存储原始数据,类似于今天的数据湖,同时支持ETL/ELT流程,这些流程对这些数据进行策划,以提高其分析质量。传统上,数据湖将数据管理为半结构化格式的“一堆文件”,这使得很难提供一些简化数据仓库中ETL/ELT的关键管理功能,如事务、回滚到旧表版本和零拷贝克隆。然而,最近的一系列系统,如Delta Lake[10]和Apache Iceberg[7],提供了数据湖的事务视图,并启用了这些管理功能。当然,组织仍然需要努力编写ETL/ELT逻辑,才能使用Lakehouse创建精心策划的数据集,但总体上ETL步骤较少,如果分析师愿意,他们也可以轻松高效地查询原始数据表,就像第一代分析平台一样。
  2. Support for machine learning and data science: ML systems’ support for direct reads from data lake formats already places them in a good position to efficiently access a Lakehouse. In addition, many ML systems have adopted DataFrames as the abstraction for manipulating data, and recent systems have designed declarative DataFrame APIs [11] that enable performing query optimizations for data accesses in ML workloads. These APIs enable ML workloads to directly benefit from many optimizations in Lakehouses. 对机器学习和数据科学的支持:ML系统对数据湖格式的直接读取的支持已经使它们处于高效访问Lakehouse的有利位置。此外,许多ML系统已经采用DataFrames作为处理数据的抽象,最近的系统已经设计了声明性DataFrameAPI[11],可以在ML工作负载中对数据访问执行查询优化。这些API使ML工作负载能够直接受益于Lakehouses中的许多优化。
  3. SQL performance: Lakehouses will need to provide state of-the-art SQL performance on top of the massive Parquet/ORC datasets that have been amassed over the last decade (or in the long term, some other standard format that is exposed for direct access to applications). In contrast, classic data warehouses accept SQL and are free to optimize everything under the hood, including proprietary storage formats. Nonetheless, we show that a variety of techniques can be used to maintain auxiliary data about Parquet/ORC datasets and to optimize data layout within these existing formats to achieve competitive performance. We present results from a SQL engine over Parquet (the Databricks Delta Engine [19]) that outperforms leading cloud data warehouses on TPC-DS. SQL性能:Lakehouses将需要在过去十年中积累的大量Parquet/ORC数据集(或者从长远来看,为直接访问应用程序而公开的其他一些标准格式)之上提供最先进的SQL性能。相比之下,经典的数据仓库接受SQL,可以自由地优化引擎盖下的一切,包括专有的存储格式。尽管如此,我们表明,可以使用各种技术来维护Parquet/ORC数据集的辅助数据,并在这些现有格式中优化数据布局,以实现有竞争力的性能。我们展示了Parquet上的SQL引擎(Databricks Delta engine[19])的结果,该引擎的性能优于TPC-DS上的领先云数据仓库。

In the rest of the paper, we detail the motivation, potential technical designs, and research implications of Lakehouse platforms. 在论文的其余部分,我们详细介绍了Lakehouse平台的动机、潜在的技术设计和研究意义。

Motivation: Data Warehousing Challenges

Data warehouses are critical for many business processes, but they still regularly frustrate users with incorrect data, staleness, and high costs. We argue that at least part of each of these challenges is “accidental complexity” [18] from the way enterprise data platforms are designed, which could be eliminated with a Lakehouse.
数据仓库对许多业务流程至关重要,但它们仍然经常因数据不正确、过时和高成本而让用户感到沮丧。我们认为,这些挑战中至少有一部分是企业数据平台设计方式的“偶然复杂性”[18],而Lakehouse可以消除这种复杂性。

First, the top problem reported by enterprise data users today is usually data quality and reliability [47, 48]. Implementing correct data pipelines is intrinsically difficult, but today’s two-tier data architectures with a separate lake and warehouse add extra complexity that exacerbates this problem. For example, the data lake and warehouse systems might have different semantics in their supported data types, SQL dialects, etc; data may be stored with different schemas in the lake and the warehouse (e.g., denormalized in one); and the increased number of ETL/ELT jobs, spanning multiple systems, increases the probability of failures and bugs.
首先,目前企业数据用户报告的首要问题通常是数据质量和可靠性[47,48]。实现正确的数据管道本质上是困难的,但今天具有独立湖泊和仓库的双层数据架构增加了额外的复杂性,加剧了这个问题。例如,数据湖和仓库系统在其支持的数据类型、SQL方言等方面可能具有不同的语义;数据可以用不同的模式存储在湖和仓库中(例如,在一个模式中去规范化);跨越多个系统的ETL/ELT作业数量的增加增加了故障和错误的概率。

Second, more and more business applications require up-to-date data, but today’s architectures increase data staleness by having a separate staging area for incoming data before the warehouse and using periodic ETL/ELT jobs to load it. Theoretically, organizations could implement more streaming pipelines to update the data warehouse faster, but these are still harder to operate than batch jobs. In contrast, in the first-generation platforms, warehouse users had immediate access to raw data loaded from operational systems in the same environment as derived datasets. Business applications such as customer support systems and recommendation engines are simply ineffective with stale data, and even human analysts querying warehouses report stale data as a major problem [47].
其次,越来越多的业务应用程序需要最新的数据,但今天的体系结构通过在仓库之前为传入数据设置一个单独的暂存区,并使用定期的ETL/ELT作业来加载数据,从而增加了数据的陈旧性。理论上,组织可以实现更多的流式管道,以更快地更新数据仓库,但这些作业仍然比批量作业更难操作。相比之下,在第一代平台中,仓库用户可以在与派生数据集相同的环境中立即访问从操作系统加载的原始数据。客户支持系统和推荐引擎等业务应用程序对陈旧的数据根本无效,甚至查询仓库的人类分析师也报告陈旧的数据是一个主要问题[47]。

Third, a large fraction of data is now unstructured in many industries [22] as organizations collect images, sensor data, documents, etc. Organizations need easy-to-use systems to manage this data, but SQL data warehouses and their API do not easily support it.
第三,在许多行业中,由于组织收集图像、传感器数据、文档等,很大一部分数据现在是非结构化的[22]。组织需要易于使用的系统来管理这些数据,但SQL数据仓库及其API并不容易支持这些数据。

Finally, most organizations are now deploying machine learning and data science applications, but these are not well served by data warehouses and lakes. As discussed before, these applications need to process large amounts of data with non-SQL code, so they cannot run efficiently over ODBC/JDBC. As advanced analytics systems continue to develop, we believe that giving them direct access to data in an open format will be the most effective way to support them. In addition, ML and data science applications suffer from the same data management problems that classical applications do, such as data quality, consistency, and isolation [17, 27, 31], so there is immense value in bringing DBMS features to their data.
最后,大多数组织现在都在部署机器学习和数据科学应用程序,但数据仓库和湖泊并不能很好地为这些应用程序服务。如前所述,这些应用程序需要使用非SQL代码处理大量数据,因此无法在ODBC/JDBC上高效运行。随着高级分析系统的不断发展,我们相信,让他们以开放格式直接访问数据将是支持他们的最有效方式。此外,ML和数据科学应用程序也面临着与经典应用程序相同的数据管理问题,如数据质量、一致性和隔离[17,27,31],因此将DBMS功能引入其数据中具有巨大价值。

Existing steps towards Lakehouses. Several current industry trends give further evidence that customers are unsatisfied with the two-tier lake + warehouse model. First, in recent years, virtually all the major data warehouses have added support for external tables in Parquet and ORC format [12, 14, 43, 46]. This allows warehouse users to also query the data lake from the same SQL engine, but it does not make data lake tables easier to manage and it does not remove the ETL complexity, staleness, and advanced analytics challenges for data in the warehouse. In practice, these connectors also often perform poorly because the SQL engine is mostly optimized for its internal data format. Second, there is also broad investment in SQL engines that run directly against data lake storage, such as Spark SQL, Presto, Hive, and AWS Athena [3, 11, 45, 50]. However, these engines alone cannot solve all the problems with data lakes and replace warehouses: data lakes still lack basic management features such as ACID transactions and efficient access methods such as indexes to match data warehouse performance. 目前的几个行业趋势进一步证明,客户对双层湖+仓库模式不满意。首先,近年来,几乎所有主要的数据仓库都增加了对Parquet和ORC格式的外部表的支持[12,14,43,46]。这允许仓库用户也可以从同一SQL引擎查询数据湖,但这并没有使数据湖表更容易管理,也没有消除ETL的复杂性、陈旧性和仓库中数据的高级分析挑战。在实践中,这些连接器也经常表现不佳,因为SQL引擎大多针对其内部数据格式进行了优化。其次,还对直接针对数据湖存储运行的SQL引擎进行了广泛投资,如Spark SQL、Presto、Hive和AWS Athena[3,11,45,50]。然而,仅靠这些引擎并不能解决数据湖的所有问题并取代仓库:数据湖仍然缺乏基本的管理功能,如ACID事务和高效的访问方法,如索引,以匹配数据仓库的性能。

The Lakehouse Architecture

We define a Lakehouse as a data management system based on low cost and directly-accessible storage that also provides traditional analytical DBMS management and performance features such as ACID transactions, data versioning, auditing, indexing, caching, and query optimization. Lakehouses thus combine the key benefits of data lakes and data warehouses: low-cost storage in an open format accessible by a variety of systems from the former, and powerful management and optimization features from the latter. The key question is whether one can combine these benefits in an effective way: in particular, Lakehouses’ support for direct access means that they give up some aspects of data independence, which has been a cornerstone of relational DBMS design.
我们将Lakehouse定义为基于低成本和可直接访问存储的数据管理系统,该系统还提供传统的分析DBMS管理和性能功能,如ACID事务、数据版本控制、审计、索引、缓存和查询优化。因此,Lakehouses结合了数据湖和数据仓库的主要优势:前者的各种系统都可以访问开放格式的低成本存储,后者的强大管理和优化功能。关键问题是能否以有效的方式将这些好处结合起来:特别是,Lakehouses对直接访问的支持意味着它们放弃了数据独立性的某些方面,而数据独立性一直是关系DBMS设计的基石。

We note that Lakehouses are an especially good fit for cloud environments with separate compute and storage: different computing applications can run on-demand on completely separate computing nodes (e.g., a GPU cluster for ML) while directly accessing the same storage data. However, one could also implement a Lakehouse over an on-premise storage system such as HDFS.
我们注意到,Lakehouses特别适合具有独立计算和存储的云环境:不同的计算应用程序可以在完全独立的计算节点上按需运行(例如,用于ML的GPU集群),同时直接访问相同的存储数据。然而,也可以在诸如HDFS之类的内部存储系统上实现Lakehouse。

In this section, we sketch one possible design for Lakehouse systems, based on three recent technical ideas that have appeared in various forms throughout the industry. We have been building towards a Lakehouse platform based on this design at Databricks through the Delta Lake, Delta Engine and Databricks ML Runtime projects [10, 19, 38]. Other designs may also be viable, however, as are other concrete technical choices in our high-level design (e.g., our stack at Databricks currently builds on the Parquet storage format, but it is possible to design a better format). We discuss several alternatives and future directions for research.
在本节中,我们根据整个行业中以各种形式出现的三个最新技术理念,为Lakehouse系统绘制了一个可能的设计方案。我们一直在通过德尔塔湖、德尔塔引擎和Databricks ML运行时项目[10,19,38],在Databrickss构建基于此设计的Lakehouse平台。然而,其他设计可能也是可行的,我们高层设计中的其他具体技术选择也是可行的(例如,我们在Databricks的堆栈目前建立在Parquet存储格式上,但可以设计更好的格式)。我们讨论了几种替代方案和未来的研究方向。

Implementing a Lakehouse System

The first key idea we propose for implementing a Lakehouse is to have the system store data in a low-cost object store (e.g., Amazon S3) using a standard file format such as Apache Parquet, but implement a transactional metadata layer on top of the object store that defines which objects are part of a table version. This allows the system to implement management features such as ACID transactions or versioning within the metadata layer, while keeping the bulk of the data in the low-cost object store and allowing clients to directly read objects from this store using a standard file format in most cases. Several recent systems, including Delta Lake [10] and Apache Iceberg [7] have successfully added management features to data lakes in this fashion; for example, Delta Lake is now used in about half of Databricks’ workload, by thousands of customers.
我们提出的实现Lakehouse的第一个关键想法是让系统使用标准文件格式(如Apache Parquet)将数据存储在低成本的对象存储中(例如Amazon S3),但在对象存储之上实现事务元数据层,该层定义哪些对象是表版本的一部分。这允许系统在元数据层中实现诸如ACID事务或版本控制之类的管理功能,同时将大部分数据保留在低成本的对象存储中,并允许客户端在大多数情况下使用标准文件格式直接从该存储中读取对象。最近的几个系统,包括德尔塔湖[10]和阿帕奇冰山[7],已经成功地以这种方式为数据湖添加了管理功能;例如,Delta Lake现在被成千上万的客户用于Databricks大约一半的工作负载。

Although a metadata layer adds management capabilities, it is not sufficient to achieve good SQL performance. Data warehouses use several techniques to get state-of-the-art performance, such as storing hot data on fast devices such as SSDs, maintaining statistics, building efficient access methods such as indexes, and co-optimizing the data format and compute engine. In a Lakehouse based on existing storage formats, it is not possible to change the format, but we show that it is possible to implement other optimizations that leave the data files unchanged, including caching, auxiliary data structures such as indexes and statistics, and data layout optimizations.
尽管元数据层增加了管理功能,但它不足以实现良好的SQL性能。数据仓库使用多种技术来获得最先进的性能,例如将热数据存储在SSD等快速设备上,维护统计信息,构建索引等高效访问方法,以及共同优化数据格式和计算引擎。在基于现有存储格式的Lakehouse中,不可能更改格式,但我们表明,可以实现其他保持数据文件不变的优化,包括缓存、索引和统计等辅助数据结构,以及数据布局优化

Finally, Lakehouses can both speed up advanced analytics workloads and give them better data management features thanks to the development of declarative DataFrame APIs [11, 37]. Many ML libraries, such as TensorFlow and Spark MLlib, can already read data lake file formats such as Parquet [30, 37, 42]. Thus, the simplest way to integrate them with a Lakehouse would be to query the metadata layer to figure out which Parquet files are currently part of a table, and simply pass those to the ML library. However, most of these systems support a DataFrame API for data preparation that creates more optimization opportunities. DataFrames were popularized by R and Pandas [40] and simply give users a table abstraction with various transformation operators, most of which map to relational algebra. Systems such as Spark SQL have made this API declarative by lazily evaluating the transformations and passing the resulting operator plan to an optimizer [11]. These APIs can thus leverage the new optimization features in a Lakehouse, such as caches and auxiliary data, to further accelerate ML.
最后,由于开发了声明性DataFrame API,Lakehouses既可以加快高级分析工作负载的速度,又可以为其提供更好的数据管理功能[11,37]。许多ML库,如TensorFlow和Spark MLlib,已经可以读取数据湖文件格式,如Parquet[30,37,42]。因此,将它们与Lakehouse集成的最简单方法是查询元数据层,以确定哪些Parquet文件当前是表的一部分,并简单地将它们传递给ML库。然而,这些系统中的大多数都支持DataFrame API进行数据准备,从而创造更多的优化机会。DataFrames由R和Pandas[40]推广,并简单地为用户提供了一个具有各种转换运算符的表抽象,其中大多数转换运算符映射到关系代数。Spark SQL等系统通过延迟评估转换并将生成的运算符计划传递给优化器,使API具有声明性[11]。因此,这些API可以利用Lakehouse中的新优化功能,如缓存和辅助数据,进一步加速ML。

Figure 2 shows how these ideas fit together into a Lakehouse system design. In the next three sections, we expand on these technical ideas in more detail and discuss related research questions.

图2显示了这些想法是如何整合到Lakehouse系统设计中的。在接下来的三节中,我们将更详细地阐述这些技术思想,并讨论相关的研究问题。

Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics_数据湖_02

Metadata Layers for Data Management

The first component that we believe will enable Lakehouses is metadata layers over data lake storage that can raise its abstraction level to implement ACID transactions and other management features. Data lake storage systems such as S3 or HDFS only provide a low-level object store or filesystem interface where even simple operations, such as updating a table that spans multiple files, are not atomic. Organizations soon began designing richer data management layers over these systems, starting with Apache Hive ACID [33], which tracks which data files are part of a Hive table at a given table version using an OLTP DBMS and allows operations to update this set transactionally. In recent years, new systems have provided even more capabilities and improved scalability. In 2016, Databricks began developing Delta Lake [10], which stores the information about which objects are part of a table in the data lake itself as a transaction log in Parquet format, enabling it to scale to billions of objects per table. Apache Iceberg [7], which started at Netflix, uses a similar design and supports both Parquet and ORC storage. Apache Hudi [6], which started at Uber, is another system in this area focused on simplifying streaming ingest into data lakes, although it does not support concurrent writers.
我们认为将启用Lakehouses的第一个组件是数据湖存储上的元数据层,它可以提高抽象级别以实现ACID事务和其他管理功能。S3或HDFS等数据湖存储系统仅提供低级对象存储或文件系统接口,即使是简单的操作,如更新跨多个文件的表,也不是原子操作。组织很快就开始在这些系统上设计更丰富的数据管理层,从Apache Hive ACID[33]开始,它使用OLTP DBMS跟踪给定表版本的Hive表中的哪些数据文件,并允许操作以事务方式更新该集。近年来,新系统提供了更多的功能和改进的可扩展性。2016年,Databricks开始开发Delta Lake[10],它将关于哪些对象是数据湖中表的一部分的信息存储为Parquet格式的事务日志,使其能够扩展到每个表数十亿个对象。Apache Iceberg[7]始于Netflix,采用了类似的设计,同时支持Parquet和ORC存储。Apache Hudi[6]始于优步,是该领域的另一个系统,专注于简化数据湖的流式接收,尽管它不支持并发编写器。
【译者注】Delta Lake 通过使用压缩至 Apache Parquent 格式的事务性日志来提供ACID,Time Travel 以及海量数据集的高性能元数据操作(比如快速搜索查询相关的上亿个表分区)。
Experience with these systems has shown that they generally provide similar or better performance to raw Parquet/ORC data lakes, while adding highly useful management features such as transactions, zero-copy cloning and time travel to past versions of a table [10]. In addition, they are easy to adopt for organizations that already have a data lake: for example, Delta Lake can convert an existing directory of Parquet files into a Delta Lake table with zero copies just by adding a transaction log that starts with an entry that references all the existing files. As a result, organizations are rapidly adopting these metadata layers: for example, Delta Lake grew to cover half the compute-hours on Databricks in three years.
这些系统的经验表明,它们通常提供与原始Parquet/ORC数据湖类似或更好的性能,同时为过去版本的表添加了非常有用的管理功能,如事务、零拷贝克隆和时间旅行[10]。此外,对于已经有数据湖的组织来说,它们很容易采用:例如,DeltaLake可以将Parquet文件的现有目录转换为零副本的DeltaLake表,只需添加一个事务日志,该日志以引用所有现有文件的条目开头。因此,组织正在迅速采用这些元数据层:例如,三角洲湖在三年内增长到了Databricks上一半的计算时间。

In addition, metadata layers are a natural place to implement data quality enforcement features. For example, Delta Lake implements schema enforcement to ensure that the data uploaded to a table matches its schema, and constraints API [24] that allows table owners to set constraints on the ingested data (e.g., country can only be one of a list of values). Delta’s client libraries will automatically reject records that violate these expectations or quarantine them in a special location. Customers have found these simple features very useful to improve the quality of data lake based pipelines.
此外,元数据层是实现数据质量强制功能的自然场所。例如,Delta Lake实现了模式强制,以确保上传到表的数据与其模式匹配,以及允许表所有者对摄入的数据设置约束的约束API[24](例如,国家只能是值列表中的一个)。达美航空的客户库将自动拒绝违反这些期望的记录,或将其隔离在特殊位置。客户发现这些简单的功能对于提高基于数据湖的管道的质量非常有用。

Finally, metadata layers are a natural place to implement governance features such as access control and audit logging. For example, a metadata layer can check whether a client is allowed to access a table before granting it credentials to read the raw data in the table from a cloud object store, and can reliably log all accesses.
最后,元数据层是实现访问控制和审计日志等治理功能的自然场所。例如,元数据层可以在授予客户端从云对象存储读取表中原始数据的凭据之前,检查客户端是否被允许访问表,并且可以可靠地记录所有访问。

Future Directions and Alternative Designs. Because metadata layers for data lakes are a fairly new development, there are many open questions and alternative designs. For example, we designed Delta Lake to store its transaction log in the same object store that it runs over (e.g., S3) in order to simplify management (removing the need to run a separate storage system) and offer high availability and high read bandwidth to the log (the same as the object store). However, this limits the rate of transactions/second it can support due to object stores’ high latency. A design using a faster storage system for the metadata may be preferable in some cases. Likewise, Delta Lake, Iceberg and Hudi only support transactions on one table at a time, but it should be possible to extend them to support cross-table transactions. Optimizing the format of transaction logs and the size of objects managed are also open questions. 由于数据湖的元数据层是一个相当新的发展,因此存在许多悬而未决的问题和替代设计。例如,我们设计了DeltaLake,将其事务日志存储在其运行的同一对象存储中(例如S3),以简化管理(无需运行单独的存储系统),并为日志提供高可用性和高读取带宽(与对象存储相同)。但是,由于对象存储的高延迟,这限制了它每秒可以支持的事务速率。在某些情况下,使用更快的元数据存储系统的设计可能是优选的。同样,Delta Lake、Iceberg和Hudi一次只支持一个表上的事务,但应该可以扩展它们以支持跨表事务。优化事务日志的格式和所管理对象的大小也是悬而未决的问题。

SQL Performance in a Lakehouse

Perhaps the largest technical question with the Lakehouse approach is how to provide state-of-the-art SQL performance while giving up a significant portion of the data independence in a traditional DBMS design. The answer clearly depends on a number of factors, such as what hardware resources we have available (e.g., can we implement a caching layer on top of the object store) and whether we can change the data object storage format instead of using existing standards such as Parquet and ORC (new designs that improve over these formats continue to emerge [15, 28]). Regardless of the exact design, however, the core challenge is that the data storage format becomes part of the system’s public API to allow fast direct access, unlike in a traditional DBMS.
Lakehouse方法最大的技术问题可能是如何提供最先进的SQL性能,同时在传统DBMS设计中放弃很大一部分数据独立性。答案显然取决于许多因素,例如我们有什么可用的硬件资源(例如,我们是否可以在对象存储之上实现缓存层),以及我们是否可以更改数据对象存储格式,而不是使用现有的标准,如Parquet和ORC(不断出现改进这些格式的新设计[15,28])。然而,无论具体设计如何,核心挑战是数据存储格式成为系统公共API的一部分,以允许快速直接访问,这与传统DBMS不同。

We propose several techniques to implement SQL performance optimizations in a Lakehouse independent of the chosen data format, which can therefore be applied either with existing or future formats. We have also implemented these techniques within the Databricks Delta Engine [19] and show that they yield competitive performance with popular cloud data warehouses, though there is plenty of room for further performance optimizations. These format-independent optimizations are: 我们提出了几种在Lakehouse中实现SQL性能优化的技术,这些技术独立于所选的数据格式,因此可以应用于现有或未来的格式。我们还在Databricks Delta Engine[19]中实现了这些技术,并表明它们与流行的云数据仓库相比具有竞争力的性能,尽管还有很大的空间进行进一步的性能优化。这些与格式无关的优化包括:
Caching: When using a transactional metadata layer such as Delta Lake, it is safe for a Lakehouse system to cache files from the cloud object store on faster storage devices such as SSDs and RAM on the processing nodes. Running transactions can easily determine when cached files are still valid to read. Moreover, the cache can be in a transcoded format that is more efficient for the query engine to run on, matching any optimizations that would be used in a traditional “closed-world” data warehouse engine. For example, our cache at Databricks partially decompresses the Parquet data it loads. 当使用事务元数据层(如Delta Lake)时,Lakehouse系统可以安全地将云对象存储中的文件缓存在更快的存储设备上,如处理节点上的SSD和RAM。运行事务可以很容易地确定缓存文件何时仍然可以有效读取。此外,缓存可以采用代码转换格式,查询引擎运行效率更高,与传统“封闭世界”数据仓库引擎中使用的任何优化相匹配。例如,我们在Databricks的缓存部分解压缩了它加载的Parquet数据。

Auxiliary data: Even though a Lakehouse needs to expose the base table storage format for direct I/O, it can maintain other data that helps optimize queries in auxiliary files that it has full control over. In Delta Lake and Delta Engine, we maintain column min-max statistics for each data file in the table within the same Parquet file used to store the transaction log, which enables data skipping optimizations when the base data is clustered by particular columns. We are also implementing a Bloom filter based index. One can imagine implementing a wide range of auxiliary data structures here, similar to proposals for indexing “raw” data [1, 2, 34]. 即使Lakehouse需要公开用于直接I/O的基本表存储格式,它也可以在它完全控制的辅助文件中维护其他数据,以帮助优化查询。在DeltaLake和DeltaEngine中,我们在用于存储事务日志的同一Parquet文件中为表中的每个数据文件维护列最小-最大统计信息,当基本数据按特定列进行聚类时,这将启用数据跳过优化。我们还实现了一个基于Bloom过滤器的索引。可以想象在这里实现广泛的辅助数据结构,类似于索引“原始”数据的建议[1,2,34]。

Data layout: Data layout plays a large role in access performance. Even when we fix a storage format such as Parquet, there are multiple layout decisions that can be optimized by the Lakehouse system. The most obvious is record ordering: which records are clustered together and hence easiest to read together. In Delta Lake, we support ordering records using individual dimensions or space filling curves such as Z-order [39] and Hilbert curves to provide locality across multiple dimensions. One can also imagine new formats that support placing columns in different orders within each data file, choosing compression strategies differently for various groups of records, or other strategies [28]. 数据布局在访问性能中起着重要作用。即使我们修复了Parquet这样的存储格式,Lakehouse系统也可以优化多种布局决策。最明显的是记录排序:哪些记录聚集在一起,因此最容易一起读取。在Delta Lake中,我们支持使用单个维度或空间填充曲线(如Z阶[39]和Hilbert曲线)对记录进行排序,以提供跨多个维度的局部性。还可以想象新的格式支持在每个数据文件中按不同顺序放置列,为不同的记录组选择不同的压缩策略,或其他策略[28]。
These three optimizations work especially well together for the typical access patterns in analytical systems. In typical workloads, most queries tend to be concentrated against a “hot” subset of the data, which the Lakehouse can cache using the same optimized data structures as a closed-world data warehouse to provide competitive performance. For “cold” data the a cloud object store, the main determinant of performance is likely to be the amount of data read per query. In that case, the combination of data layout optimizations (which cluster co-accessed data) and auxiliary data structures such as zone maps (which let the engine rapidly figure out what ranges of the data files to read) can allow a Lakehouse system to minimize I/O the same way a closed-world proprietary data warehouse would, despite running against a standard open file format. 对于分析系统中的典型访问模式,这三种优化配合得特别好。在典型的工作负载中,大多数查询往往集中在数据的“热”子集上,Lakehouse可以使用与封闭世界数据仓库相同的优化数据结构来缓存这些数据,以提供有竞争力的性能。对于云对象存储中的“冷”数据,性能的主要决定因素可能是每个查询读取的数据量。在这种情况下,数据布局优化(对共同访问的数据进行聚类)和辅助数据结构(如区域图)的结合(使引擎能够快速确定要读取的数据文件的范围)可以使Lakehouse系统最小化I/O,就像封闭世界的专有数据仓库一样,尽管它是按照标准的开放文件格式运行的。

Performance Results. At Databricks, we combined these three Lakehouse optimizations with a new C++ execution engine for Apache Spark called Delta Engine [19]. To evaluate the feasibility of the Lakehouse architecture, Figure 3 compares Delta Engine on TPC-DS at scale factor 30,000 with four widely used cloud data warehouses (from cloud providers as well as third-party companies that run over public clouds), using comparable clusters on AWS, Azure and Google Cloud with 960 vCPUs each and local SSD storage.1 We report the time to run all 99 queries as well as the total cost for customers in each service’s pricing model (Databricks lets users choose spot and on-demand instances, so we show both). Delta Engine provides comparable or better performance than these systems at a lower price point. 在Databricks,我们将这三个Lakehouse优化与Apache Spark的一个新的C++执行引擎Delta engine[19]相结合。为了评估Lakehouse架构的可行性,图3将规模因子为30000的TPC-DS上的Delta Engine与四个广泛使用的云数据仓库(来自云提供商以及在公共云上运行的第三方公司)进行了比较,使用了AWS上的可比集群,Azure和谷歌云各有960个vCPU和本地SSD存储。1我们报告了运行所有99个查询的时间,以及客户在每个服务的定价模型中的总成本(Databricks允许用户选择现场和按需实例,因此我们同时显示了这两种情况)。Delta Engine以更低的价格提供了与这些系统相当或更好的性能。

Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics_数据_03


Future Directions and Alternative Designs. Designing performant yet directly-accessible Lakehouse systems is a rich area for future work. One clear direction that we have not explored yet is designing new data lake storage formats that will work better in this use case, e.g., formats that provide more flexibility for the Lakehouse system to implement data layout optimizations or indexes over or are simply better suited to modern hardware. Of course, such new formats may take a while for processing engines to adopt, limiting the number of clients that can read from them, but designing a high quality directly-accessible open format for next generation workloads is an important research problem. 设计高性能但可直接访问的Lakehouse系统是未来工作的一个丰富领域。我们尚未探索的一个明确方向是设计新的数据湖存储格式,这些格式在本用例中会更好地工作,例如,为Lakehouse系统提供更大的灵活性,以实现数据布局优化或索引,或者更适合现代硬件。当然,处理引擎可能需要一段时间才能采用这种新格式,从而限制了可以从中读取的客户端数量,但为下一代工作负载设计一种高质量的可直接访问的开放格式是一个重要的研究问题。

Even without changing the data format, there are many types of caching strategies, auxiliary data structures and data layout strategies to explore for Lakehouses [4, 49, 53]. Determining which ones are likely to be most effective for massive datasets in cloud object stores is an open question. 即使不改变数据格式,Lakehouses也有许多类型的缓存策略、辅助数据结构和数据布局策略需要探索[4,49,53]。确定哪些可能对云对象存储中的海量数据集最有效是一个悬而未决的问题。

Finally, another exciting research direction is determining when and how to use serverless computing systems to answer queries [41] and optimizing the storage, metadata layer, and query engine designs to minimize latency in this case. 最后,另一个令人兴奋的研究方向是确定何时以及如何使用无服务器计算系统来回答查询[41],并优化存储、元数据层和查询引擎设计,以最大限度地减少这种情况下的延迟。

Efficient Access for Advanced Analytics

Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics_数据仓库_04


As we discussed earlier in the paper, advanced analytics libraries are usually written using imperative code that cannot run as SQL, yet they need to access large amounts of data. There is an interesting research question in how to design the data access layers in these libraries to maximize flexibility for the code running on top but still benefit from optimization opportunities in a Lakehouse.

正如我们在论文前面讨论的那样,高级分析库通常使用命令式代码编写,这些代码不能作为SQL运行,但它们需要访问大量数据。有一个有趣的研究问题是,如何设计这些库中的数据访问层,以最大限度地提高运行在上面的代码的灵活性,但仍然可以从Lakehouse中的优化机会中受益。

One approach that we’ve had success with is offering a declarative version of the DataFrame APIs used in these libraries, which maps data preparation computations into Spark SQL query plans and can benefit from the optimizations in Delta Lake and Delta Engine. We used this approach in both Spark DataFrames [11] and in Koalas [35], a new DataFrame API for Spark that offers improved compatibility with Pandas. DataFrames are the main data type used to pass input into the ecosystem of advanced analytics libraries for Apache Spark, including MLlib [37], GraphFrames [21], SparkR [51] and many community libraries, so all of these workloads can enjoy accelerated I/O if we can optimize the DataFrame computation. Spark’s query planner pushes selections and projections in the user’s DataFrame computation directly into the “data source” plugin class for each data source read. Thus, in our implementation of the Delta Lake data source, we leverage the caching, data skipping and data layout optimizations described in Section 3.3 to accelerate these reads from Delta Lake and thus accelerate ML and data science workloads, as illustrated in Figure 4.
我们成功的一种方法是提供这些库中使用的DataFrame API的声明性版本,该版本将数据准备计算映射到Spark SQL查询计划中,并可以从Delta Lake和Delta Engine中的优化中受益。我们在Spark DataFrames[11]和Koalas[35]中都使用了这种方法,Koalas[35]是Spark的一种新的DataFrameneneneba API,它提供了与Pandas更好的兼容性。DataFrames是用于将输入传递到Apache Spark高级分析库生态系统的主要数据类型,包括MLlib[37]、GraphFrames[21]、SparkR[51]和许多社区库,因此如果我们能够优化DataFrame计算,所有这些工作负载都可以享受加速的I/O。Spark的查询规划器将用户DataFrame计算中的选择和投影直接推送到每次读取数据源的“数据源”插件类中。因此,在我们对Delta Lake数据源的实现中,我们利用第3.3节中描述的缓存、数据跳过和数据布局优化来加速从Delta Lake的读取,从而加速ML和数据科学工作负载,如图4所示。
Machine learning APIs are quickly evolving, however, and there are also other data access APIs, such as TensorFlow’s tf.data, that do not attempt to push query semantics into the underlying storage system. Many of these APIs also focus on overlapping data loading on the CPU with CPU-to-GPU transfers and GPU computation, which has not received much attention in data warehouses. Recent systems work has shown that keeping modern accelerators well utilized, especially for ML inference, can be a difficult problem [44], so Lakehouse access libraries will need to tackle this challenge.
然而,机器学习API正在快速发展,还有其他数据访问API,如TensorFlow的tf.data,它们不试图将查询语义推送到底层存储系统中。这些API中的许多还专注于CPU上的数据加载与CPU到GPU的传输和GPU计算的重叠,这在数据仓库中没有得到太多关注。最近的系统工作表明,充分利用现代加速器,特别是ML推理,可能是一个难题[44],因此Lakehouse访问库需要解决这一挑战。

Future Directions and Alternative Designs. Apart from the questions about existing APIs and efficiency that we have just discussed, we can explore radically different designs for data access interfaces from ML. For example, recent work has proposed “factorized ML” frameworks that push ML logic into SQL joins, and other query optimizations that can be applied for ML algorithms implemented in SQL [36]. Finally, we still need standard interfaces to let data scientists take full advantage of the powerful data management capabilities in Lakehouses (or even data warehouses). For example, at Databricks, we have integrated Delta Lake with the ML experiment tracking service in MLflow [52] to let data scientists easily track the table versions used in an experiment and reproduce that version of the data later. There is also an emerging abstraction of feature stores in the industry as a data management layer to store and update the features used in an ML application [26, 27, 31], which would benefit from using the standard DBMS functions in a Lakehouse design, such as transactions and data versioning. 除了我们刚刚讨论的关于现有API和效率的问题外,我们还可以探索与ML截然不同的数据访问接口设计。例如,最近的工作提出了将ML逻辑推入SQL联接的“因式分解ML”框架,以及其他可用于SQL中实现的ML算法的查询优化[36]。最后,我们仍然需要标准接口,让数据科学家充分利用Lakehouses(甚至数据仓库)强大的数据管理能力。例如,在Databricks,我们将Delta Lake与MLflow[52]中的ML实验跟踪服务集成在一起,让数据科学家可以轻松跟踪实验中使用的表格版本,并在以后复制该版本的数据。行业中还有一种新兴的特征存储抽象,作为存储和更新ML应用程序中使用的特征的数据管理层[26,27,31],这将受益于在Lakehouse设计中使用标准DBMS功能,如事务和数据版本控制。

Research Questions and Implications

Beyond the research challenges that we raised as future directions in Sections 3.2–3.4, Lakehouses raise several other research questions. In addition, the industry trend toward increasingly feature-rich data lakes has implications for other areas of data systems research. 除了我们在第3.2节至第3.4节中提出的作为未来方向的研究挑战之外,Lakehouses还提出了其他几个研究问题。此外,数据湖越来越丰富的行业趋势对数据系统研究的其他领域也有影响。
Are there other ways to achieve the Lakehouse goals? One can imagine other means to achieve the primary goals of the Lakehouse, such as building a massively parallel serving layer for a data warehouse that can support parallel reads from advanced analytics workloads. However, we believe that such infrastructure will be significantly more expensive to run, harder to manage, and likely less performant than giving workloads direct access to the object store. We have not seen broad deployment of systems that add this type of serving layer, such as Hive LLAP [32]. Moreover, this approach punts the problem of selecting an efficient data format for reads to the serving layer, and this format still needs to be easy to transcode from the warehouse’s internal format. The main draws of cloud object stores are their low cost, high bandwidth access from elastic workloads, and extremely high availability; all three get worse with a separate serving layer in front of the object store. 可以想象实现Lakehouse主要目标的其他方法,例如为数据仓库构建一个大规模并行服务层,该层可以支持从高级分析工作负载中并行读取。然而,我们认为,与让工作负载直接访问对象存储相比,这种基础设施的运行成本要高得多,管理难度更大,性能也可能更低。我们还没有看到添加这种类型服务层的系统的广泛部署,例如Hive LLAP[32]。此外,这种方法将选择有效的数据格式进行读取的问题转移到服务层,并且这种格式仍然需要易于从仓库的内部格式进行代码转换。云对象存储的主要吸引力在于其低成本、弹性工作负载的高带宽访问以及极高的可用性;在对象存储前面有一个单独的服务层,这三种情况都会变得更糟。

Beyond the performance, availability, cost and lock-in challenges with these alternate approaches, there are also important governance reasons why enterprises may prefer to keep their data in an open format. With increasing regulatory requirements about data management, organizations may need to search through old datasets, delete various data, or change their data processing infrastructure on short notice, and standardizing on an open format means that they will always have direct access to the data without blocking on a vendor. The long-term trend in the software industry has been towards open data formats, and we believe that this trend will continue for enterprise data.除了这些替代方法在性能、可用性、成本和锁定方面的挑战外,企业可能更喜欢以开放格式保存数据还有重要的治理原因。随着对数据管理的监管要求不断提高,组织可能需要在短时间内搜索旧数据集、删除各种数据或更改其数据处理基础设施,而以开放格式进行标准化意味着他们将始终可以直接访问数据,而无需阻止供应商。软件行业的长期趋势是开放数据格式,我们相信这一趋势将在企业数据中持续下去。

What are the right storage formats and access APIs? The access interface to a Lakehouse includes the raw storage format, client libraries to directly read this format (e.g., when reading into TensorFlow), and a high-level SQL interface. There are many different ways to place rich functionality across these layers, such as storage schemes that provide more flexibility to the system by asking readers to perform more sophisticated, “programmable" decoding logic [28]. It remains to be seen which combination of storage formats, metadata layer designs, and access APIs works best. Lakehouse的访问接口包括原始存储格式、直接读取该格式的客户端库(例如,在读取TensorFlow时)以及高级SQL接口。有许多不同的方法可以在这些层中放置丰富的功能,例如通过要求读者执行更复杂的“可编程”解码逻辑来为系统提供更大灵活性的存储方案[28]。存储格式、元数据层设计和访问API的哪种组合最有效还有待观察。

How does the Lakehouse affect other data management research and trends? The prevalence of data lakes and the increasing use of rich management interfaces over them, whether they be metadata layers or the full Lakehouse design, has implications for several other areas of data management research. 数据湖的普遍存在以及对其丰富管理接口的日益使用,无论是元数据层还是整个Lakehouse设计,都对数据管理研究的其他几个领域产生了影响。

Polystores were designed to solve the difficult problem of querying data across disparate storage engines [25]. This problem will persist in enterprises, but the increasing fraction of data that is available in an open format in a cloud data lake means that many polystore queries could be answered by running directly against the cloud object store, even if the underlying data files are part of logically separate Lakehouse deployments.
多存储旨在解决跨不同存储引擎查询数据的难题[25]。这个问题将在企业中持续存在,但云数据湖中以开放格式可用的数据比例越来越高,这意味着许多多存储查询可以通过直接针对云对象存储运行来回答,即使底层数据文件是逻辑上独立的Lakehouse部署的一部分。

Data integration and cleaning tools can also be designed to run in place over a Lakehouse with fast parallel access to all the data, which may enable new algorithms such as running large joins and clustering algorithms over many of the datasets in an organization. 数据集成和清理工具也可以设计为在Lakehouse上原位运行,对所有数据进行快速并行访问,这可能会启用新的算法,例如在组织中的许多数据集上运行大型联接和聚类算法。

HTAP systems could perhaps be built as “bolt-on” layers in front of a Lakehouse by archiving data directly into a Lakehouse system using its transaction management APIs. The Lakehouse would be able to query consistent snapshots of the data. HTAP系统可以通过使用其事务管理API将数据直接归档到Lakehouse系统中,在Lakehouses前面构建为“螺栓连接”层。Lakehouse将能够查询数据的一致快照。

Data management for ML may also become simpler and more powerful if implemented over a Lakehouse. Today, organizations are building a wide range of ML-specific data versioning and “feature store” systems [26, 27, 31] that reimplement standard DBMS functionality. It might be simpler to just use a data lake abstraction with DBMS management functions built-in to implement feature store functionality. At the same time, declarative ML systems such as factorized ML [36] could likely run well against a Lakehouse. 如果在Lakehouse上实现,ML的数据管理也可能变得更简单、更强大。如今,组织正在构建一系列特定于ML的数据版本控制和“功能存储”系统[26,27,31],以重新实现标准DBMS功能。只使用带有内置DBMS管理功能的数据湖抽象来实现功能存储功能可能会更简单。同时,声明性ML系统,如因子分解ML[36],可能在Lakehouse上运行良好。

Cloud-native DBMS designs such as serverless engines [41] will need to integrate with richer metadata management layers such as Delta Lake instead of just scanning over raw files in a data lake, but may be able to achieve increased performance.
云原生DBMS设计(如无服务器引擎[41])将需要与更丰富的元数据管理层(如Delta Lake)集成,而不仅仅是扫描数据湖中的原始文件,但可能能够提高性能。

Finally, there is ongoing discussion in the industry about how to organize data engineering processes and teams, with concepts such as the “data mesh” [23], where separate teams own different data products end-to-end, gaining popularity over the traditional “central data team” approach. Lakehouse designs lend themselves easily to distributed collaboration structures because all datasets are directly accessible from an object store without having to onboard users on the same compute resources, making it straightforward to share data regardless of which teams produce and consume it.
最后,业界正在讨论如何组织数据工程流程和团队,其中包括“数据网格”[23]等概念,即独立的团队端到端拥有不同的数据产品,这比传统的“中央数据团队”方法更受欢迎。Lakehouse的设计很容易实现分布式协作结构,因为所有数据集都可以从对象存储直接访问,而不必在同一计算资源上搭载用户,因此无论哪个团队生产和使用数据,都可以直接共享数据。

Related Work

The Lakehouse approach builds on many research efforts to design data management systems for the cloud, starting with early work to use S3 as a block store in a DBMS [16] and to “bolt-on” consistency over cloud object stores [13]. It also builds heavily on research to accelerate query processing by building auxiliary data structures around a fixed data format [1, 2, 34, 53]. Lakehouse方法建立在为云设计数据管理系统的许多研究工作的基础上,从早期工作开始,将S3用作DBMS中的块存储[16],并“加强”云对象存储的一致性[13]。它还大量建立在通过围绕固定数据格式构建辅助数据结构来加速查询处理的研究之上[1,2,34,53]。

The most closely related systems are “cloud-native” data warehouses backed by separate storage [20, 29] and data lake systems like Apache Hive [50]. Cloud-native warehouses such as Snowflake and BigQuery [20, 29] have seen good commercial success, but they are still not the primary data store in most large organizations: the majority of data continues to be in data lakes, which can easily store the time-series, text, image, audio and semi-structured formats that high-volume enterprise data arrives in. As a result, cloud warehouse systems have all added support to read external tables in data lake formats [12, 14, 43, 46]. However, these systems cannot provide any management features over the data in data lakes (e.g., implement ACID transactions over it) the same way they do for their internal data, so using them with data lakes continues to be difficult and error-prone. Data warehouses are also not a good fit for large-scale ML and data science workloads due to the inefficiency in streaming data out of them compared to direct object store access.
最密切相关的系统是由独立存储[20,29]支持的“云原生”数据仓库和Apache Hive[50]等数据湖系统。Snowflake和BigQuery[20,29]等云原生仓库已经取得了良好的商业成功,但它们仍然不是大多数大型组织的主要数据存储:大多数数据仍然存在于数据湖中,可以轻松存储大容量企业数据到达的时间序列、文本、图像、音频和半结构化格式,云仓库系统都增加了对读取数据湖格式的外部表的支持[12,14,43,46]。然而,这些系统不能像对内部数据那样对数据湖中的数据提供任何管理功能(例如,在数据湖上实现ACID事务),因此将它们与数据湖一起使用仍然很困难,而且容易出错。数据仓库也不太适合大规模的ML和数据科学工作负载,因为与直接对象存储访问相比,数据仓库的数据流传输效率低下。

On the other hand, while early data lake systems purposefully cut down the feature set of a relational DBMS for ease of implementation, the trend in all these systems has been to add ACID support [33] and increasingly rich management and performance features [6, 7, 10]. In this paper, we extrapolate this trend to discuss what technical designs may allow Lakehouse systems to completely replace data warehouses, show quantitative results from a new query engine optimized for a Lakehouse, and sketch some significant research questions and design alternatives in this domain.
另一方面,虽然早期的数据湖系统有目的地减少关系DBMS的特征集以便于实现,但所有这些系统的趋势都是添加ACID支持[33]以及越来越丰富的管理和性能特征[6,7,10]。在本文中,我们推断了这一趋势,讨论了哪些技术设计可以让Lakehouse系统完全取代数据仓库,展示了为Lakehouses优化的新查询引擎的定量结果,并概述了该领域的一些重要研究问题和设计替代方案。

Conclusion

We have argued that a unified data platform architecture that implements data warehousing functionality over open data lake file formats can provide competitive performance with today’s data warehouse systems and help address many of the challenges facing data warehouse users. Although constraining a data warehouses’s storage layer to open, directly-accessible files in a standard format appears like a significant limitation at first, optimizations such as caching for hot data and data layout optimization for cold data can allow Lakehouse systems to achieve competitive performance. We believe that the industry is likely to converge towards Lakehouse designs given the vast amounts of data already in data lakes and the opportunity to greatly simplify enterprise data architectures. 我们认为,在开放的数据湖文件格式上实现数据仓库功能的统一数据平台架构可以提供与当今数据仓库系统具有竞争力的性能,并有助于解决数据仓库用户面临的许多挑战。尽管最初限制数据仓库的存储层以标准格式打开可直接访问的文件似乎是一个重大限制,但优化(如热数据缓存和冷数据数据布局优化)可以使Lakehouse系统获得有竞争力的性能。我们相信,鉴于数据湖中已经存在大量数据,并且有机会大大简化企业数据架构,该行业很可能会向Lakehouse设计靠拢。

Acknowledgements
We thank the Delta Engine, Delta Lake, and Benchmarking teams at Databricks for their contributions to the results we discuss in this work. Awez Syed, Alex Behm, Greg Rahn, Mostafa Mokhtar, Peter Boncz, Bharath Gowda, Joel Minnick and Bart Samwel provided valuable feedback on the ideas in this paper. We also thank the CIDR reviewers for their feedback.

References

[1] I. Alagiannis, R. Borovica-Gajic, M. Branco, S. Idreos, and A. Ailamaki. NoDB: Efficient query execution on raw data files. CACM, 58(12):112–121, Nov. 2015.
[2] I. Alagiannis, S. Idreos, and A. Ailamaki. H2O: a hands-free adaptive store. In SIGMOD, 2014.
[3] Amazon Athena. https://aws.amazon.com/athena/.
[4] G. Ananthanarayanan, A. Ghodsi, A. Warfield, D. Borthakur, S. Kandula, S. Shenker, and I. Stoica. PACMan: Coordinated memory caching for parallel jobs. In NSDI, pages 267–280, 2012.
[5] Apache Hadoop. https://hadoop.apache.org.
[6] Apache Hudi. https://hudi.apache.org.
[7] Apache Iceberg. https://iceberg.apache.org.
[8] Apache ORC. https://orc.apache.org.
[9] Apache Parquet. https://parquet.apache.org.
[10] M. Armbrust, T. Das, L. Sun, B. Yavuz, S. Zhu, M. Murthy, J. Torres, H. van Hovell, A. Ionescu, A. undefineduszczak, M. undefinedwitakowski, M. Szafrański, X. Li, T. Ueshin, M. Mokhtar, P. Boncz, A. Ghodsi, S. Paranjpye, P. Senster, R. Xin, and M. Zaharia. Delta Lake: High-performance ACID table storage over cloud object stores. In VLDB, 2020.
[11] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational data processing in Spark. In SIGMOD, 2015.
[12] Azure Synapse: Create external file format. https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql.
[13] P. Bailis, A. Ghodsi, J. Hellerstein, and I. Stoica. Bolt-on causal consistency. pages 761–772, 06 2013.
[14] BigQuery: Creating a table definition file for an external data source. https://cloud.google.com/bigquery/external-table-definition, 2020.
[15] P. Boncz, T. Neumann, and V. Leis. FSST: Fast random access string compression. In VLDB, 2020.
[16] M. Brantner, D. Florescu, D. Graf, D. Kossmann, and T. Kraska. Building a database on S3. In SIGMOD, pages 251–264, 01 2008.
[17] E. Breck, M. Zinkevich, N. Polyzotis, S. Whang, and S. Roy. Data validation for machine learning. In SysML, 2019.
[18] F. Brooks, Jr. No silver bullet – essence and accidents of software engineering. IEEE Computer, 20:10–19, April 1987.
[19] A. Conway and J. Minnick. Introducing Delta Engine. https://databricks.com/blog/2020/06/24/introducing-delta-engine.html.
[20] B. Dageville, J. Huang, A. Lee, A. Motivala, A. Munir, S. Pelley, P. Povinec, G. Rahn, S. Triantafyllis, P. Unterbrunner, T. Cruanes, M. Zukowski, V. Antonov, A. Avanes, J. Bock, J. Claybaugh, D. Engovatov, and M. Hentschel. The Snowflake elastic data warehouse. pages 215–226, 06 2016.
[21] A. Dave, A. Jindal, L. E. Li, R. Xin, J. Gonzalez, and M. Zaharia. GraphFrames: An integrated API for mixing graph and relational queries. In Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems, GRADES ’16, New York, NY, USA, 2016. Association for Computing Machinery
[22] D. Davis. AI unleashes the power of unstructured data. https://www.cio.com/article/3406806/, 2019.
[23] Z. Dehghani. How to move beyond a monolithic data lake to a distributed data mesh. https://martinfowler.com/articles/data-monolith-to-mesh.html, 2019.
[24] Delta Lake constraints. https://docs.databricks.com/delta/delta-constraints.html, 2020.
[25] J. Duggan, A. J. Elmore, M. Stonebraker, M. Balazinska, B. Howe, J. Kepner, S. Madden, D. Maier, T. Mattson, and S. Zdonik. The BigDAWG polystore system. SIGMOD Rec., 44(2):11–16, Aug. 2015.
[26] Data Version Control (DVC). https://dvc.org.
[27] Feast: Feature store for machine learning. https://feast.dev, 2020.
[28] B. Ghita, D. G. Tomé, and P. A. Boncz. White-box compression: Learning and exploiting compact table representations. In CIDR. www.cidrdb.org, 2020.
[29] Google BigQuery. https://cloud.google.com/bigquery
[30] Getting data into your H2O cluster. https://docs.h2o.ai/h2o/latest-stable/h2odocs/getting-data-into-h2o.html, 2020.
[31] K. Hammar and J. Dowling. Feature store: The missing data layer in ML pipelines? https://www.logicalclocks.com/blog/feature-store-the-missing-datalayer-in-ml-pipelines, 2018.
[32] Hive LLAP. https://cwiki.apache.org/confluence/display/Hive/LLAP, 2020.
[33] Hive ACID documentation. https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/using-hiveql/content/hive_3_internals.html.
[34] S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. Here are my data files. here are my queries. where are my results? In CIDR, 2011.
[35] koalas library. https://github.com/databricks/koalas, 2020.
[36] S. Li, L. Chen, and A. Kumar. Enabling and optimizing non-linear feature interactions in factorized linear algebra. In SIGMOD, page 1571–1588, 2019.
[37] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine learning in Apache Spark. J. Mach. Learn. Res.,
17(1):1235–1241, Jan. 2016.
[38] Databricks ML runtime. https://databricks.com/product/machine-learning-runtime.
[39] G. M. Morton. A computer oriented geodetic data base; and a new technique in file sequencing. IBM Technical Report, 1966.
[40] pandas Python data analysis library. https://pandas.pydata.org, 2017.
[41] M. Perron, R. Castro Fernandez, D. DeWitt, and S. Madden. Starling: A scalable query engine on cloud functions. In SIGMOD, page 131–141, 2020.
[42] Petastorm. https://github.com/uber/petastorm.
[43] Redshift CREATE EXTERNAL TABLE. https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html, 2020.
[44] D. Richins, D. Doshi, M. Blackmore, A. Thulaseedharan Nair, N. Pathapati, A. Patel, B. Daguman, D. Dobrijalowski, R. Illikkal, K. Long, D. Zimmerman, and V. Janapa Reddi. Missing the forest for the trees: End-to-end ai application performance in edge data centers. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 515–528, 2020.
[45] R. Sethi, M. Traverso, D. Sundstrom, D. Phillips, W. Xie, Y. Sun, N. Yegitbasi, H. Jin, E. Hwang, N. Shingte, and C. Berner. Presto: SQL on everything. In ICDE, pages 1802–1813, April 2019.
[46] Snowflake CREATE EXTERNAL TABLE. https://docs.snowflake.com/en/sqlreference/sql/create-external-table.html, 2020.
[47] Fivetran data analyst survey. https://fivetran.com/blog/analyst-survey, 2020.
[48] M. Stonebraker. Why the ’data lake’ is really a ’data swamp’. BLOG@CACM, 2014.
[49] L. Sun, M. J. Franklin, J. Wang, and E. Wu. Skipping-oriented partitioning for columnar layouts. Proc. VLDB Endow., 10(4):421–432, Nov. 2016.
[50] A. Thusoo et al. Hive - a petabyte scale data warehouse using Hadoop. In ICDE, pages 996–1005. IEEE, 2010.
[51] S. Venkataraman, Z. Yang, D. Liu, E. Liang, H. Falaki, X. Meng, R. Xin, A. Ghodsi, M. Franklin, I. Stoica, and M. Zaharia. SparkR: Scaling R programs with Spark. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, page 1099–1104, New York, NY, USA, 2016. Association for Computing Machinery.
[52] M. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S. Hong, A. Konwinski, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe, F. Xie, and C. Zumar. Accelerating the machine learning lifecycle with mlflow. IEEE Data Eng. Bull., 41:39–45, 2018.
[53] M. Ziauddin, A. Witkowski, Y. J. Kim, D. Potapov, J. Lahorani, and M. Krishna. Dimensions based data clustering and zone maps. Proc. VLDB Endow., 10(12):1622–1633, Aug. 2017.

https://zhuanlan.zhihu.com/p/502062796