大数据olap引擎对比



(Introduction)

This article is an extract from my previous article Big Data Pipeline Recipe. In this post I will focus only on Big Data query engines for data analytics.

本文摘自我之前的文章《 大数据管道食谱》 。 在本文中,我将仅关注用于数据分析的大数据查询引擎

If you are here, is because you have ingested raw data, processed it and it is now ready to be consumed by downstream systems.

如果在这里,是因为您已经摄取了原始数据,对其进行了处理,并且现在可供下游系统使用。

There are a wide range of tools used to query the data, each one has its advantages and disadvantages. Most of them focused on OLAP but few are also optimized for OLTP. Some use standard formats and focus only on running the queries whereas others use their own format/storage to push processing to the source to improve performance. Some are optimized for data warehousing using star or snowflake schema whereas others are more flexible. To summarize, these are the different considerations:

有多种工具可用于查询数据,每种工具都有其优点和缺点。 它们中的大多数都集中在OLAP上,但是也很少针对OLTP进行过优化。 有些使用标准格式,仅专注于运行查询,而另一些使用自己的格式/存储将处理推送到源,以提高性能。 有些使用星型或雪花模式针对数据仓库进行了优化,而另一些则更为灵活。 总结起来, 这些是不同的考虑因素

  • Data warehouse vs data lake
  • Hadoop vs Standalone
  • OLAP vs OLTP
  • Query Engine vs. OLAP Engines

We should also consider processing engines with querying capabilities.

我们还应该考虑具有查询功能的处理引擎。

(Processing Engines)

Most of the engines we described in the previous section can connect to the metadata server such as Hive and run queries, create views, etc. This is a common use case to create refined reporting layers.

我们在上一节中描述的大多数引擎都可以连接到元数据服务器,例如Hive并运行查询,创建视图等。这是创建精细报表层的常见用例。

Spark SQL provides a way to seamlessly mix SQL queries with Spark programs, so you can mix the DataFrame API with SQL. It has Hive integration and standard connectivity through JDBC or ODBC; so you can connect Tableau, Looker or any BI tool to your data through Spark.

Spark SQL提供了一种将SQL查询与Spark程序无缝混合的方法,因此您可以将DataFrame API与SQL混合。 它具有Hive集成和通过JDBC或ODBC的标准连接; 因此您可以通过Spark将TableauLooker或任何BI工具连接到数据。

Apache Flink also provides SQL API. Flink’s SQL support is based on Apache Calcite which implements the SQL standard. It also integrates with Hive through the HiveCatalog. For example, users can store their Kafka or ElasticSearch tables in Hive Metastore by using HiveCatalog, and reuse them later on in SQL queries.

Apache Flink还提供SQL API。 FlinkSQL支持基于实现SQL标准的Apache Calcite 。 它还通过HiveCatalogHive集成。 例如,用户可以使用HiveCatalog将其Kafka或ElasticSearch表存储在Hive Metastore中,并稍后在SQL查询中重新使用它们。

Kafka also provides SQL capabilities and in general, most processing engines provide SQL interfaces.

Kafka还提供SQL功能,通常,大多数处理引擎都提供SQL接口。

(Query Engines)

This type of tools focus on querying different data sources and formats in an unified way. The idea is to query your data lake using SQL queries like if it was a relational database, although it has some limitations. Some of these tools can also query NoSQL databases and much more. These tools provide a JDBC interface for external tools, such as Tableau or Looker, to connect in a secure fashion to your data lake. Query engines are the slowest option but provide the maximum flexibility.

这类工具专注于以统一的方式查询不同的数据源和格式 。 这个想法是使用SQL查询来查询您的数据湖,就像它是一个关系数据库一样,尽管它有一些限制。 其中一些工具还可以查询NoSQL数据库等等。 这些工具为外部工具(如TableauLooker)提供JDBC接口,以安全方式连接到您的数据湖。 查询引擎是最慢的选择,但提供最大的灵活性。

  • Apache Pig: It was one of the first query languages along with Hive. It has its own language different from SQL. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. It is not in decline in favor of newer SQL based engines.
    Apache Pig :它是Hive中最早的查询语言之一。 它有自己的语言,不同于SQL。 Pig程序的显着特性是它们的结构适合于实质性的并行化 ,从而使它们能够处理非常大的数据集。 它并不支持基于SQL的新引擎。
  • Presto: Released as open source by Facebook, it’s an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes. Presto allows querying data where it lives, including Hive, Cassandra, relational databases and file systems. It can perform queries on large data sets in a manner of seconds. It is independent of Hadoop but integrates with most of its tools, especially Hive to run SQL queries.
    Presto :由Facebook作为开源发布,它是一个开源的分布式SQL查询引擎,用于对各种规模的数据源运行交互式分析查询。 Presto允许查询数据所在的位置,包括Hive,Cassandra,关系数据库和文件系统。 它可以在几秒钟内对大型数据集执行查询。 它独立于Hadoop,但与大多数工具集成在一起,尤其是Hive以运行SQL查询。
  • Apache Drill: Provides a schema-free SQL Query Engine for Hadoop, NoSQL and even cloud storage. It is independent of Hadoop but has many integrations with the ecosystem tools such Hive. A single query can join data from multiple datastores performing optimizations specific to each data store. It is very good at allowing analysts to treat any data like a table, even if they are reading a file under the hood. Drill supports fully standard SQL. Business users, analysts and data scientists can use standard BI/analytics tools such as Tableau, Qlik and Excel to interact with non-relational data stores by leveraging Drill’s JDBC and ODBC drivers. Furthermore, developers can leverage Drill’s simple REST API in their custom applications to create beautiful visualizations.
    Apache Drill :提供用于Hadoop,NoSQL甚至云存储的无模式SQL查询引擎。 它独立于Hadoop,但与Hive等生态系统工具集成了许多功能。 单个查询可以联接来自多个数据存储的数据,从而执行特定于每个数据存储的优化。 即使分析人员在后台读取文件,它也非常擅长让分析人员将任何数据视为表。 Drill支持完全标准SQL 。 商业用户,分析师和数据科学家可以使用标准的BI /分析工具(如TableauQlik和Excel)来利用Drill的JDBC和ODBC驱动程序与非关系数据存储进行交互。 此外,开发人员可以在其自定义应用程序中利用Drill的简单REST API创建漂亮的可视化文件。



大数据kappa和lambda架构_java


Drill model 钻头型号

(OLTP Databases)

Although, Hadoop is optimized for OLAP there are still some options if you want to perform OLTP queries for an interactive application.

尽管Hadoop针对OLAP进行了优化,但是如果您要为交互式应用程序执行OLTP查询,仍然有一些选择。

HBase is has very limited ACID properties by design, since it was built to scale and does not provides ACID capabilities out of the box but it can be used for some OLTP scenarios.

HBase的 由于它是按比例缩放的,并且不是开箱即用的提供ACID功能,因此它在设计上具有非常有限的ACID属性,但是可以用于某些OLTP场景。

Apache Phoenix is built on top of HBase and provides a way to perform OTLP queries in the Hadoop ecosystem. Apache Phoenix is fully integrated with other Hadoop products such as Spark, Hive, Pig, Flume, and Map Reduce. It also can store metadata and it supports table creation and versioned incremental alterations through DDL commands. It is quite fast, faster than using Drill or other query engine.

Apache Phoenix建立在HBase之上,并提供了一种在Hadoop生态系统中执行OTLP查询的方法。 Apache Phoenix与其他Hadoop产品(例如Spark,Hive,Pig,Flume和Map Reduce)完全集成。 它还可以存储元数据,并支持通过DDL命令创建表和版本化的增量更改。 这是相当快的 ,比使用钻或其他查询引擎更快。

You may use any massive scale database outside the Hadoop ecosystem such as Cassandra, YugaByteDB, ScyllaDB for OTLP.

您可以使用Hadoop生态系统之外的任何大规模数据库,例如Cassandra,YugaByteDB,ScyllaDB for OTLP

Finally, it is very common to have a subset of the data, usually the most recent, in a fast database of any type such MongoDB or MySQL. The query engines mentioned above can join data between slow and fast data storage in a single query.

最后,在任何类型的快速数据库(例如MongoDB或MySQL)中拥有数据的子集(通常是最新数据)是很常见的。 上面提到的查询引擎 可以在单个查询中在慢速和快速数据存储之间加入数据。

(Distributed Search Indexes)

These tools provide a way to store and search unstructured text data and they live outside the Hadoop ecosystem since they need special structures to store the data. The idea is to use an inverted index to perform fast lookups. Besides text search, this technology can be used for a wide range of use cases like storing logs, events, etc. There are two main options:

这些工具提供了一种存储和搜索非结构化文本数据的方法, 并且由于它们需要特殊的结构来存储数据 ,因此它们位于Hadoop生态系统之外。 这个想法是使用倒排索引来执行快速查找。 除了文本搜索外,该技术还可以用于各种用例,例如存储日志,事件等。有两个主要选项:

  • Solr: it is a popular, blazing-fast, open source enterprise search platform built on Apache Lucene. Solr is reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. It is great for text search but its use cases are limited compared to ElasticSearch.
    Solr :这是一个基于Apache Lucene的流行,快速,开放源代码的企业搜索平台。 Solr是可靠的,可伸缩的和容错的,提供分布式索引,复制和负载平衡查询,自动故障转移和恢复,集中式配置等。 它非常适合于文本搜索,但是与ElasticSearch相比,它的用例有限。
  • ElasticSearch: It is also a very popular distributed index but it has grown into its own ecosystem which covers many use cases like APM, search, text storage, analytics, dashboards, machine learning and more. It is definitely a tool to have in your toolbox either for DevOps or for your data pipeline since it is very versatile. It can also store and search videos and images.
    ElasticSearch :它也是一个非常受欢迎的分布式索引,但是已经发展成为自己的生态系统,涵盖了许多用例,例如APM ,搜索,文本存储,分析,仪表板,机器学习等。 它绝对是一种非常有用的工具,可用于DevOps或数据管道,因为它非常通用。 它还可以存储和搜索视频和图像。

ElasticSearch can be used as a fast storage layer for your data lake for advanced search functionality. If you store your data in a key-value massive database, like HBase or Cassandra, which provide very limited search capabilities due to the lack of joins; you can put ElasticSearch in front to perform queries, return the IDs and then do a quick lookup on your database.

ElasticSearch可用作高级存储功能快速存储层或您的数据湖。 如果将数据存储在键值型海量数据库中,例如HBase或Cassandra,由于缺少联接,它们提供的搜索功能非常有限; 您可以将ElasticSearch放在前面以执行查询,返回ID,然后对数据库进行快速查找。

It can be used also for analytics; you can export your data, index it and then query it using Kibana, creating dashboards, reports and much more, you can add histograms, complex aggregations and even run machine learning algorithms on top of your data. The Elastic Ecosystem is huge and worth exploring.

它也可以用于分析 ; 您可以导出数据,对其进行索引,然后使用Kibana对其进行查询,创建仪表板,报告等,还可以添加直方图,复杂的聚合甚至在数据之上运行机器学习算法。 弹性生态系统非常庞大,值得探索。


大数据kappa和lambda架构_数据库_02

(OLAP Databases)

In this category we have databases which may also provide a metadata store for schemas and query capabilities. Compared to query engines, these tools also provide storage and may enforce certain schemas in case of data warehouses (star schema). These tools use SQL syntax and Spark and other frameworks can interact with them.

在此类别中,我们有数据库,该数据库还可以提供用于模式和查询功能的元数据存储。 与查询引擎相比,这些工具还提供存储,并且在数据仓库的情况下可以强制执行某些架构( 星型架构 )。 这些工具使用SQL语法,Spark和其他框架可以与它们进行交互。

  • Apache Hive: We already discussed Hive as a central schema repository for Spark and other tools so they can use SQL, but Hive can also store data, so you can use it as a data warehouse. It can access HDFS or HBase. When querying Hive it leverages on Apache Tez, Apache Spark, or MapReduce, being Tez or Spark much faster. It also has a procedural language called HPL-SQL. Hive it is extremely popular meta store for Spark SQL.
    Apache Hive :我们已经讨论过Hive作为Spark和其他工具的中央模式存储库,以便他们可以使用SQL ,但是Hive也可以存储数据,因此您可以将其用作数据仓库。 它可以访问HDFSHBase 。 查询Hive时,它会利用Apache TezApache SparkMapReduce ,从而使Tez或Spark快得多。 它还具有一种称为HPL-SQL的过程语言。 蜂巢它是Spark SQL极为流行的元存储。
  • Apache Impala: It is a native analytic database for Hadoop, that you can use to store data and query it in an efficient manner. It can connect to Hive for metadata using Hcatalog. Impala provides low latency and high concurrency for BI/analytic queries on Hadoop (not delivered by batch frameworks such as Apache Hive). Impala also scales linearly, even in multitenant environments making a better alternative for queries than Hive. Impala is integrated with native Hadoop security and Kerberos for authentication, so you can securely managed data access. It uses HBase and HDFS for data storage.
    Apache Impala :这是Hadoop的本地分析数据库 ,您可以使用它来存储数据并以有效的方式对其进行查询。 它可以使用Hcatalog连接到Hive获取元数据。 Impala为Hadoop上的BI /分析查询提供了低延迟和高并发性(不是由批处理框架(如Apache Hive)提供的)。 即使在多租户环境中,Impala也会线性扩展,从而比Hive更好地替代查询。 Impala与本机Hadoop安全性和Kerberos集成在一起以进行身份验证,因此您可以安全地管理数据访问。 它使用HBaseHDFS进行数据存储。


  • Apache Tajo: It is another data warehouse for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL on large-data sets stored on HDFS and other data sources. It has integration with Hive Metastore to access the common schemas. It has many query optimizations, it is scalable, fault tolerant and provides a JDBC interface.
    Apache Tajo :这是Hadoop的另一个数据仓库 。 Tajo专为针对存储在HDFS和其他数据源上的大数据集的低延迟和可扩展的即席查询,在线聚合和ETL而设计。 它与Hive Metastore集成在一起以访问通用模式。 它具有许多查询优化功能,具有可伸缩性,容错能力,并提供JDBC接口。
  • Apache Kylin: Apache Kylin is a newer distributed Analytical Data Warehouse. Kylin is extremely fast, so it can be used to complement some of the other databases like Hive for use cases where performance is important such as dashboards or interactive reports, it is probably the best OLAP data warehouse but it is more difficult to use, another problem is that because of the high dimensionality, you need more storage. The idea is that if query engines or Hive are not fast enough, you can create a “Cube” in Kylin which is a multidimensional table optimized for OLAP with pre computed values which you can query from your dashboards or interactive reports. It can build cubes directly from Spark and even in near real time from Kafka.
    Apache Kylin :Apache Kylin是更新的分布式分析数据仓库 。 Kylin的运行速度非常快 ,因此对于仪表盘或交互式报表等性能非常重要的用例,它可以用于补充Hive等其他一些数据库,它可能是最好的OLAP数据仓库,但使用起来比较困难。问题在于,由于维数高,您需要更多的存储空间。 这个想法是,如果查询引擎或Hive不够快,则可以在Kylin中创建“ 多维数据集 ”,这是针对OLAP优化的多维表,具有可从仪表板或交互式报表中查询的预先计算的值。 它可以直接从Spark生成多维数据集,甚至可以从Kafka实时生成多维数据集。


(OLAP Engines)

In this category, I include newer engines that are an evolution of the previous OLAP databases which provide more functionality creating an all-in-one analytics platform. Actually, they are a hybrid of the previous two categories adding indexing to your OLAP databases. They live outside the Hadoop platform but are tightly integrated. In this case, you would typically skip the processing phase and ingest directly using these tools.

在此类别中,我包括较新的引擎 ,这些引擎是对以前的OLAP数据库的改进,该数据库提供了创建多合一分析平台的更多功能。 实际上,它们是前两种类别的混合 ,为您的OLAP数据库添加了索引 。 它们位于Hadoop平台之外,但紧密集成。 在这种情况下,您通常会跳过处理阶段并直接使用这些工具进行提取。

They try to solve the problem of querying real time and historical data in an uniform way, so you can immediately query real-time data as soon as it’s available alongside historical data with low latency so you can build interactive applications and dashboards. These tools allow in many cases to query the raw data with almost no transformation in an ELT fashion but with great performance, better than regular OLAP databases.

他们试图解决以统一的方式查询实时和历史数据的问题,因此您可以立即将实时数据与具有低延迟的历史数据一起立即查询,从而可以构建交互式应用程序和仪表板。 这些工具在许多情况下允许以ELT方式进行几乎没有任何转换的原始数据查询,但性能却优于常规OLAP数据库。

What they have in common is that they provided a unified view of the data, real time and batch data ingestion, distributed indexing, its own data format, SQL support, JDBC interface, hot-cold data support, multiple integrations and a metadata store.

它们的共同点是它们提供了数据统一视图,实时和批处理数据的接收,分布式索引,其自身的数据格式,SQL支持,JDBC接口,热冷数据支持,多种集成元数据存储。

  • Apache Druid: It is the most famous real time OLAP engine. It focused on time series data but it can be used for any kind of data. It uses its own columnar format which can heavily compress the data and it has a lot of built in optimizations like inverted indices, text encoding, automatic data roll up and much more. Data is ingested in real time using Tranquility or Kafka which has very low latency, data is kept in memory in a row format optimized for writes but as soon as it arrives is available to be query just like previous ingested data. A background task in in charge of moving the data asynchronously to a deep storage system such HDFS. When data is moved to deep storage it is converted into smaller chunks partitioned by time called segments which are highly optimized for low latency queries. It segment has a timestamp, several dimensions which you can use to filter and perform aggregations; and metrics which are pre computed aggregations. For batch ingestion, it saves data directly into Segments. It support push and pull ingestion. It has integrations with Hive, Spark and even NiFi. It can use Hive metastore and it supports Hive SQL queries which then are converted to JSON queries used by Druid. The Hive integration supports JDBC so you can connect any BI tool. It also has its own metadata store, usually MySQL. It can ingest vast amounts of data and scale very well. The main issue is that it has a lot of components and it is difficult to manage and deploy.
    Apache Druid :这是最著名的实时OLAP引擎。 它专注于时间序列数据,但可用于任何类型的数据。 它使用自己的列式格式可以严重压缩数据,并且具有很多内置的优化功能,例如倒排索引 ,文本编码,自动数据汇总等等。 使用具有极低延迟的Tranquility或Kafka实时摄取数据,数据以针对写入进行优化的行格式保存在内存中,但是一旦到达,就可以像以前摄取的数据一样查询。 后台任务,负责将数据异步移动到深度存储系统(例如HDFS)。 当数据移至深层存储时,它将转换为较小的块,这些块按时间进行了划分 ,这些针对低延迟查询进行了高度优化。 它的片段具有时间戳,可以用于过滤和执行聚合的几个维度; 和指标是预先计算的汇总。 对于批量提取,它将数据直接保存到细分中。 它支持推拉式摄取。 它与Hive,Spark甚至NiFi 集成在一起 它可以使用Hive Metastore,并且支持Hive SQL查询,然后将其转换为Druid使用的JSON查询。 Hive集成支持JDBC,因此您可以连接任何BI工具。 它还有自己的元数据存储,通常是MySQL。 它可以吸收大量数据并很好地扩展。 主要问题是它具有许多组件,并且难以管理和部署。


大数据kappa和lambda架构_mysql_03

Druid architecture 德鲁伊架构

  • Apache Pinot: It is a newer alternative to Druid open sourced by LinkedIn. Compared to Druid, it offers lower latency thanks to the Startree index which offer partial pre computation, so it can be used for user facing apps(it used to get the LinkedIn feeds). It uses a sorted index instead of inverted index which is faster. It has an extendable plugin architecture and also has many integrations but does not support Hive. It also unifies batch and real time, provides fast ingestion, smart index and stores the data in segments. It is easier to deploy and faster compared to Druid but it is a bit immature at the moment.
    Apache Pinot :它是LinkedIn开源的Druid的更新替代品。 与Druid相比,由于Startree索引提供了部分预计算,因此它提供了更低的延迟 ,因此它可以用于面向用户的应用程序(用于获取LinkedIn提要)。 它使用排序索引而不是倒排索引,索引速度更快。 它具有可扩展的插件架构,并且具有许多集成,但不支持Hive。 它还统一了批处理和实时功能,提供快速接收,智能索引并将数据分段存储。 与Druid相比,它更容易部署且速度更快,但目前还不成熟


大数据kappa和lambda架构_java_04

Apache Pinot 阿皮诺

  • ClickHouse: Written in C++, this engine provides incredible performance for OLAP queries, especially aggregations. It looks like a relational database so you can model the data very easily. It is very easy to set up and has many integrations.
    ClickHouse :用C ++编写,此引擎为OLAP查询(尤其是聚合)提供了令人难以置信的性能。 它看起来像一个关系数据库,因此您可以非常轻松地对数据建模。 它非常容易设置,并且具有许多集成。


ClickHouse ClickHouse

Check this article which compares the 3 engines in detail. Again, start small and know your data before making a decision, these new engines are very powerful but difficult to use. If you can wait a few hours, then use batch processing and a data base such Hive or Tajo; then use Kylin to accelerate your OLAP queries to make them more interactive. If that’s not enough and you need even lower latency and real time data, consider OLAP engines. Druid is more suitable for real-time analysis. Kylin is more focused on OLAP cases. Druid has good integration with Kafka as real-time streaming; Kylin fetches data from Hive or Kafka in batches; although real time ingestion is planned.

查看这篇文章 ,其中详细比较了3个引擎。 同样,从小处着手并在做出决定之前了解您的数据,这些新引擎非常强大,但难以使用 。 如果您可以等待几个小时,则使用批处理和Hive或Tajo等数据库; 然后使用Kylin加快OLAP查询的速度,使其更具交互性。 如果这还不够,并且您需要更低的延迟和实时数据,请考虑使用OLAP引擎。 德鲁伊更适合实时分析。 麒麟更专注于OLAP案件。 Druid与Kafka的实时流媒体集成良好。 Kylin分批从Hive或Kafka获取数据; 尽管计划实时摄取。

Finally, Greenplum is another OLAP engine with more focus on AI.

最后, Greenplum是另一个更专注于AI的 OLAP引擎。


Presto/Drill provide more flexibility, Kylin great latency, Druid and Pinot, the best of both worlds. Presto / Drill提供更大的灵活性,Kylin的延迟长,Druid和Pinot兼具两者的优点。

(Data Visualization)

Finally, for visualization you have several commercial tools such Qlik, Looker or Tableau.

最后,对于可视化,您有几个商业工具,例如QlikLookerTableau

For Open Source, check SuperSet, an amazing tool that support all the tools we mentioned, has a great editor and it is really fast, it uses SQLAlchemy under the hood to support many databases.

对于开放源代码,请检查SuperSet ,它是支持我们提到的所有工具的出色工具,具有出色的编辑器,而且速度非常快,它在后台使用SQLAlchemy来支持许多数据库。

Metabase or Falcon are other great options.

MetabaseFalcon是其他不错的选择。

(Conclusion)

We have seem a wide range of tools that can be used to query your data. From flexible query engines like Presto to high performance data ware house like Kylin. There is no single solution, my advise is to get to know your data and start small. Query engines are a good starting point because of its flexibility. Then, for different use cases you may need to add more tools to meet your SLAs.

我们似乎有各种各样的工具可用于查询您的数据。 从诸如Presto之类的灵活查询引擎到诸如Kylin之类的高性能数据仓库。 没有单一的解决方案 ,我的建议是了解您的数据并从小处着手。 由于其灵活性,查询引擎是一个很好的起点。 然后,对于不同的用例,您可能需要添加更多工具来满足您的SLA。

Pay special attention to newer tools like Druid or Pinot which provide an easy way to analyze huge amounts of data with very low latency, closing the gap between OLTP and OLAP in terms of performance. You may be tempted to think about processing, pre compute aggregations, etc. but consider these tools if you want to simplify your pipeline.

要特别注意像DruidPinot这样的新工具,它们提供了一种以极低的延迟来分析大量数据的简便方法,从而在性能方面缩小了OLTP和OLAP之间的差距。 您可能很想考虑处理,预计算聚合等问题,但是如果您想简化管道,请考虑使用这些工具。

I hope you enjoyed this article. Feel free to leave a comment or share this post. Follow me for future post.

希望您喜欢这篇文章。 随时发表评论或分享这篇文章。 我来 为将来的职位。


翻译自: https://itnext.io/olap-query-engines-for-big-data-5f17b88d6ebc

大数据olap引擎对比