一个完整的Hadoop 2.0 Cluster采用模块化设计,其核心项目包括:
- Hadoop Common: The common utilities that support the other Hadoop modules.例如权限管理等功能。
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.用于存储数据,和普通的文件系统提供的功能类似。
- Hadoop YARN: A framework for job scheduling and cluster resource management.资源管理与调度系统。例如如何检测是否有新机器加入了集群,如果给新的机器协调分配工作;如何检测是否有机器故障并离开了集群,且如何将故障机器数据和计算转移到其他节点。
在Hadoop的官网上下载并安装Hadoop后,以上4个项目也就全部安装好了。这4个项目互相协同,支撑Hadoop集群的工作。但其中的HDFS和YARN作为独立的模块也被应用到其他地方。例如另一个分布式计算系统Spark,也是采用模块化设计。其中的底层数据存储系统就可以使用HDFS(也可以使用普通的本地文件系统),且Spark的资源管理与调度系统可以使用YARN(或另一个Mesos等)。
其他Hadoop生态圈项目
其他不是必须的,但提供了更多功能和更广泛应用场景的项目还有很多。这些项目需要单独安装才能使用。
例如Ambari就是一个基于Web的用于管理Hadoop集群的管理工具。Hive则是搭建在Hadoop之上的,提供类SQL查询的数据仓库。其类SQL查询语句(称为HQL)可实现对HDFS上数据的快速查询分析。
- Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
- Avro™: A data serialization system.
- Cassandra™: A scalable multi-master database with no single points of failure.
- Chukwa™: A data collection system for managing large distributed systems.
- HBase™: A scalable, distributed database that supports structured data storage for large tables.
- Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Mahout™: A Scalable machine learning and data mining library.
- Pig™: A high-level data-flow language and execution framework for parallel computation.
- Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
- Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
- ZooKeeper™: A high-performance coordination service for distributed applications.
Hadoop发行版本
类似于linux是开源操作系统,因此我们可以看到很多其他的经过改造,增强的linux操作系统版本,如Ubuntu,CentOS等。Hadoop作为开源项目,除了Hadoop官网提供的官方版本以外,同样有商业公司提供增强改造版的Hadoop发行版本,其中最著名的就是Cloudera和Hortonworks。这两家公司提供完整的Hadoop项目解决方案,并为合作伙伴提供咨询,支持,培训等服务。
Cloudera提供的Hadoop发行版本称为CDH,其集成了Hadoop,Pig,HBase,Hive等等几乎所有Hadoop相关项目,且也为开源的。Hortonworks提供的版本名为HDP,也集成了众多Hive,Pig,Ambari,HCatalog等等项目,且开源。
总的来说和官方版本相比,Cloudera和Hortonworks都增强了Hadoop的功能,简化了搭建管理Hadoop集群的时间和人员技术成本。