Hadoop vs MPP

原创

hyunbar777 2021-08-02 13:52:14 ©著作权

文章标签 hadoop sql sed spark ide 文章分类 Java 后端开发

©著作权归作者所有：来自51CTO博客作者hyunbar777的原创作品，请联系作者获取转载授权，否则将追究法律责任

If we take a look 5 years back, that was the time when Hadoop was not an option for most of the companies, especially for the enterprises that ask for stable and mature platforms. At that very moment the choice was very simple: when your analytical database grow beyond 5-7 terabytes in size you just initiate an MPP migration project and move to one of the proven enterprise MPP solutions. No one heard about the “unstructured” data – if you got to analyze logs just parse them with Perl/Python/Java/C++ and load into you analytical DBMS. And no one heard about high velocity data – simply use traditional OLTP RDBMS for frequent updates and chunk them for insertion into the analytical DWH.But the time passed and the buzzword “big data” started to flow around in the air, and also in the mass media and social networks. Here’s google trends graph for “big data”:

Hadoop vs MPP_spark

Hadoop vs MPP_sql_02

People were discussing the “three V” and the approaches to handle these huge amounts of data.

Hadoop has emerged from the niche technology to one of the top-notch tools for data processing, getting more popular with more big companies investing into it, either by starting the broad Hadoop implementation, or by investing into one of the Hadoop vendors, or by becoming a Hadoop vendor by themselves. And as Hadoop became more and more popular, MPP databases entered their descent. You can take a look at the Teradata stocks as an example of this – for the last 3 years they are constantly falling down, and the main reason for this is that the new player has entered their market, and it is Hadoop.

So the question regarding “whether I should choose MPP solution or Hadoop-based solution?” asked by the newcomers is becoming really popular. Many of the vendors are positioning Hadoop as a replacement of the traditional data warehouse, meaning by this the replacement of the MPP solutions. Some of them are more conservative in the messaging and pushing the Data Lake / Data Hub concept, when Hadoop and MPP leave beside each other and integrating together in a single solution.

Hadoop vs MPP_hadoop_03

Hadoop vs MPP_ide_04

So what MPP is? MPP stands for Massive Parallel Processing, this is the approach in grid computing when all the separate nodes of your grid are participating in the coordinated computations. MPP DBMSs are the database management systems built on top of this approach. In these systems each query you are staring is split into a set of coordinated processes executed by the nodes of your MPP grid in parallel, splitting the computations the way they are running times faster than in traditional SMP RDBMS systems. One additional advantage that this architecture delivers to you is the scalability, because you can easily scale the grid by adding new nodes into it. To be able to handle huge amounts of data, the data in these solutions is usually split between nodes (sharded) the way that each node processes only its local data. This further speeds up the processing of the data, because using shared storage for this kind of design would be a huge overkill – more complex, more expensive, less scalable, higher network utilization, less parallelism. This is why most of the MPP DBMS solutions are shared-nothing and work on DAS storage or the set of storage shelves shared between small groups of servers. This approach is used by solutions like Teradata, Greenplum, Vertica, Netezza and other similar ones. All of them have a complex and mature SQL optimizers developed specifically for MPP solutions. All of them are extensible in terms of built-in languages and the toolset around these solutions supporting almost any of the customer wish, whether it is geospatial analytics, full-text search of data mining. All of them are closed-source complex enterprise solutions (but FYI, Greenplum would be open sourced in 2015’Q4) being around in this industry for years and they are stable enough to run the mission-critical workloads of their users.

Hadoop vs MPP_spark_05

Hadoop vs MPP_sed_06

What about Hadoop? This is not a single technology, it is an ecosystem of related projects, which has its pros and cons.

The biggest pro is extensibility – many new components arise (like Spark some time ago) and they are kept integrated with the core technologies of the base Hadoop, which prevents you from the lock-in and allows to further grow your cluster use cases. As a con I can put the fact that building the platform of a separate technologies by yourself is a hell lot of work and no one is doing it manually now, most of the companies are running pre-built platforms like the ones provided by Cloudera and Hortonworks.

Hadoop storage technology is built on a completely different approach. Instead of sharding the data based on some kind of a key, it chunks the data into blocks of a fixed (configurable) size and splits them between the nodes. The chunks are big and they are read-only as well as the overall filesystem (HDFS).

To put it simple, loading small 100-row table into MPP would cause the engine to shard the data based on the key of your table, this way in a big enough cluster there is a huge probability that each of the nodes will store only one row. In contrast, in HDFS the whole small table would be written in a single block, which would be represented as a single file on the datanode’s filesystems.

Hadoop vs MPP_sed_07

Hadoop vs MPP_ide_08

Next, what about the cluster resource management? In contrast to MPP design, Hadoop resource manager (YARN) is giving you more fine-grained resource management – compared to MPP, the MapReduce jobs does not require all its computational tasks to run in parallel, so you can even process a huge amounts of data within a set of tasks running on a single node if the other part of your cluster is completely utilized. It also has a series of nice features like extensibility, support for long-living containers and so on. But in fact it is slower than MPP resource manager and sometimes not that good in managing concurrency.

Hadoop vs MPP_spark_09

Hadoop vs MPP_hadoop_10

Next is the SQL interface for Hadoop. Here you have a wide choice of tools: it might be Hive running on MR/Tez/Spark, it might be SparkSQL, it might be Impala or HAWQ or IBM BigSQL, it might be something completely different like Splice Machine. You’ve got a wide choice and it’s very easy to get lost in the technologies like this.

First option is Hive, it is an engine that translate SQL queries into MR/Tez/Spark jobs and executes them on the cluster. All the jobs are built on top of the same MapReduce concept and give you good cluster utilization options and good integration with other Hadoop stack. But the cons are big as well – big latency in executing the queries, lower performance especially for table joins, no query optimizer (at least for now) so the engine executes what you ask it to, even if it is the worst option. This picture covers the obsolete MR1 design but it is not important in our context:

Hadoop vs MPP_sql_11

Hadoop vs MPP_spark_12

Solutions like Impala and HAWQ are on the other side of this edge, they are MPP execution engines on top of Hadoop working with the data stored in HDFS. They can offer you much lower latency and lower processing time for the queries at the cost of less scalability and less stability, just like the other MPP engines:

Hadoop vs MPP_hadoop_13

Hadoop vs MPP_spark_14

SparkSQL is a different beast sitting between the MapReduce and MPP-over-Hadoop approaches, trying to get the best of both worlds and having its own drawbacks. Similarly to MR, it splits the job into a set of tasks scheduled separately giving better stability. Like MPP, it tries to stream the data between execution stages to speed up the processing. Also it uses fixed executors concept which is familiar to MPP (with its impalad and HAWQ segments) to reduce the latency of the queries. But it also combines the drawbacks of these solutions – not that fast as MPP, not that stable and scalable as MapReduce.As I covered all of the technologies separately, here it the table bringing it all together:

Hadoop vs MPP_ide_15

MPP 与 Hadoop 的对比：

	MPP	Hadoop
平台开放性	专有，也有例外	完全开源
硬件	许多解决方案有特有设备，我们无法在自己的集群上部署软件。所有解决方案都需要特定的企业级硬件。	任何硬件都可以使用，供应商提供了一些配置准则。大多数建议是将便宜的商品硬件与DAS结合使用
扩展性(节点)	平均数十个节点，最大100-200	平均100个节点，最大数千个
扩展性(用户)	平均数十TB，最大PB	平均几百TB，最大数十PB
查询延迟	10-20毫秒	10-20秒
查询平均运行时间	5-7秒	10-15分钟
查询最大运行时间	1-2小时	1-2周
查询优化	复杂的企业查询优化器引擎	没有优化器或优化器功能比较局限
查询调试与分析	有查询执行计划、查询执行统计信息以及解释性错误消息	OOM问题和Java堆 dump 分析、集群GC暂停组件，每个任务的单独日志
技术价格	每个节点数十至数十万美元	免费或每个节点高达数千美元
访问友好性	简单友好的SQL接口和简单可解释的数据库内函数	SQL并不完全符合ANSI，用户应注意执行逻辑，底层数据布局。函数通常需要用Java编写，编译并放在集群中
目标用户	业务分析师	Java开发人员和经验丰富的DBA
目标系统	通用DWH和分析系统	专用数据处理引擎
最小建议大小	任意	GB
最大并发	数十到数百个查询	最多10-20个作业
技术可扩展性	仅使用供应商提供的工具	与介绍的任何开源工具（Spark，Samza，Tachyon等）兼容
解决方案实施复杂度	中等	高

Given all this information, you can conclude why Hadoop cannot be used as a complete replacement of the traditional enterprise data warehouse, but it can be used as an engine for processing huge amounts data in a distributed way and getting important insights from your data. Facebook has a 300PB Hadoop installation and they still use a small 50TB Vertica cluster, LinkedIn has a huge Hadoop cluster and still they use Aster Data cluster (MPP bought by Teradata), and you can continue with this list.

Hadoop vs MPP_ide_16