Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张表,并提供类SQL查询功能.说白了,hive就是MapReduce客户端,将用户编写的HQL语法转换成MR程序进行执行。那么,hive是如何实现将hql语法转换成Mr的呢?
如图中所示,总的来说,Hive是通过给用户提供的一系列交互接口,接收到用户的指令(SQL),使用自己的Driver,结合元数据(MetaStore),将这些指令翻译成MapReduce,提交到Hadoop中执行,最后,将执行返回的结果输出到用户交互接口。
1)用户接口:Client
CLI(hiveshell)、JDBC/ODBC(java访问hive)、WEBUI(浏览器访问hive)
2)元数据:Metastore
元数据包括:表名、表所属的数据库(默认是default)、表的拥有者、列/分区字段、表的类型(是否是外部表)、表的数据所在目录等;
默认存储在自带的derby数据库中,推荐使用MySQL存储Metastore
3)Hadoop
使用HDFS进行存储,使用MapReduce进行计算。
4)驱动器:Driver
(1)解析器(SQL Parser):将SQL字符串转换成抽象语法树AST,这一步一般都用第三方工具库完成,比如antlr;对AST进行语法分析,比如表是否存在、字段是否存在、SQL语义是否有误。
(2)编译器(Physical Plan):将AST编译生成逻辑执行计划。
(3)优化器(Query Optimizer):对逻辑执行计划进行优化。
(4)执行器(Execution):把逻辑执行计划转换成可以运行的物理计划。对于Hive来说,就是MR/Spark。
演示:可以使用explain命令,查看语法树,即hive先将sql语法转成对应的语法树
> explain select * from mytest_staff_info_demo4_cp1 > where statis_date='20180229' > order by name > limit 3; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: mytest_staff_info_demo4_cp1 Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: name (type: string), deptcode (type: string), id (type: int), account (type: string), areacode (type: string), statis_date (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int), _col3 (type: string), _col4 (type: string), _col5 (type: string) Reduce Operator Tree: Extract Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE Limit Number of rows: 3 Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE File Output Operator compressed: true Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: 3 Time taken: 0.149 seconds, Fetched: 38 row(s) |
可以看出一个hive任务分为一个或多个stage阶段。不同的stage阶段会存在依赖关系,越复杂的sql会产生越多的stage,通常也需要越多的时间,通常像我们公司(top3电商)在IDE平台上任务的执行时间一般不要超过45分钟,否则最好分割优化任务。
还可以使用explain extended 查看更多信息。
> explain extended select * from mytest_staff_info_demo4_cp1 > where statis_date='20180229' > order by name > limit 3; OK ABSTRACT SYNTAX TREE: TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME mytest_staff_info_demo4_cp1 TOK_INSERT TOK_DESTINATION TOK_DIR TOK_TMP_FILE TOK_SELECT TOK_SELEXPR TOK_ALLCOLREF TOK_WHERE = TOK_TABLE_OR_COL statis_date '20180229' TOK_ORDERBY TOK_TABSORTCOLNAMEASC TOK_TABLE_OR_COL name TOK_LIMIT 3 STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: mytest_staff_info_demo4_cp1 Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE GatherStats: false Select Operator expressions: name (type: string), deptcode (type: string), id (type: int), account (type: string), areacode (type: string), statis_date (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE tag: -1 value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int), _col3 (type: string), _col4 (type: string), _col5 (type: string) Path -> Alias: hdfs://SuningHadoop2/user/finance/hive/warehouse/fdm_sor.db/mytest_staff_info_demo4_cp1/statis_date=20180229 [mytest_staff_info_demo4_cp1] Path -> Partition: hdfs://SuningHadoop2/user/finance/hive/warehouse/fdm_sor.db/mytest_staff_info_demo4_cp1/statis_date=20180229 Partition base file name: statis_date=20180229 input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat partition values: statis_date 20180229 properties: bucket_count -1 columns name,deptcode,id,account,areacode columns.comments columns.types string:string:int:string:string field.delim file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location hdfs://SuningHadoop2/user/finance/hive/warehouse/fdm_sor.db/mytest_staff_info_demo4_cp1/statis_date=20180229 name fdm_sor.mytest_staff_info_demo4_cp1 partition_columns statis_date partition_columns.types string serialization.ddl struct mytest_staff_info_demo4_cp1 { string name, string deptcode, i32 id, string account, string areacode} serialization.format serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe transient_lastDdlTime 1521166440 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: bucket_count -1 columns name,deptcode,id,account,areacode columns.comments columns.types string:string:int:string:string field.delim file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location hdfs://SuningHadoop2/user/finance/hive/warehouse/fdm_sor.db/mytest_staff_info_demo4_cp1 name fdm_sor.mytest_staff_info_demo4_cp1 partition_columns statis_date partition_columns.types string serialization.ddl struct mytest_staff_info_demo4_cp1 { string name, string deptcode, i32 id, string account, string areacode} serialization.format serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe transient_lastDdlTime 1519780670 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: fdm_sor.mytest_staff_info_demo4_cp1 name: fdm_sor.mytest_staff_info_demo4_cp1 Truncated Path -> Alias: /fdm_sor.db/mytest_staff_info_demo4_cp1/statis_date=20180229 [mytest_staff_info_demo4_cp1] Needs Tagging: false Reduce Operator Tree: Extract Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE Limit Number of rows: 3 Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE File Output Operator compressed: true GlobalTableId: 0 directory: alluxio-ft://namenode1-sit.cnsuning.com:19998/user/finance/tmp/hive-finance/hive_2018-03-19_09-35-04_376_4971094765120649963-1/-ext-10001 NumFilesPerFileSink: 1 Statistics: Num rows: 0 Data size: 195 Basic stats: PARTIAL Column stats: NONE Stats Publishing Key Prefix: alluxio-ft://namenode1-sit.cnsuning.com:19998/user/finance/tmp/hive-finance/hive_2018-03-19_09-35-04_376_4971094765120649963-1/-ext-10001/ table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: columns _col0,_col1,_col2,_col3,_col4,_col5 columns.types string:string:int:string:string:string escape.delim \ hive.serialization.extend.nesting.levels true serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe TotalFiles: 1 GatherStats: false MultiFileSpray: false Stage: Stage-0 Fetch Operator limit: 3 Time taken: 0.151 seconds, Fetched: 133 row(s) |