Hive的底层就是MapReduce的编程实现,我们可以通过执行计划详细的了解执行过程。对于我们对底层的理解,有很大的帮助。
语法及结构
官方对Hive Explain的英文解释,如果大家英文不错的话,强推:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain
首先,Explain的语法:
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query
从语法组成可以看出来是一个“explain ”+三个可选参数+查询语句。大家可以积极尝试一下,后面两个显示内容很简单的,我介绍一下第一个 extended 这个可以显示hql语句的语法树
其次,执行计划一共有三个部分:
- 这个语句的抽象语法树
- 这个计划不同阶段之间的依赖关系
- 对于每个阶段的详细描述
简单例子
我首先举一个简单的例子:select * from emp;(十分常见的一张表)
- 展示数据
hive> select * from emp;
OK
369 SMITH CLERK 7902 1980-12-17 00:00:00 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 00:00:00 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1982-12-09 00:00:00 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 00:00:00 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500.0 0.0 30
7876 ADAMS CLERK 7788 1983-01-12 00:00:00 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-03 00:00:00 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-03 00:00:00 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-01-23 00:00:00 1300.0 NULL 10
Time taken: 1.305 seconds, Fetched: 14 row(s)
- 展示执行计划
hive> explain extended select * from emp;
OK
ABSTRACT SYNTAX TREE:
TOK_QUERY
TOK_FROM
TOK_TABREF
TOK_TABNAME
emp
TOK_INSERT
TOK_DESTINATION
TOK_DIR
TOK_TMP_FILE
TOK_SELECT
TOK_SELEXPR
TOK_ALLCOLREF
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: emp
Statistics: Num rows: 2 Data size: 820 Basic stats: COMPLETE Column stats: NONE
GatherStats: false
Select Operator
expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: double), comm (type: double), deptno (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
Statistics: Num rows: 2 Data size: 820 Basic stats: COMPLETE Column stats: NONE
ListSink
Time taken: 1.844 seconds, Fetched: 34 row(s)
- 第一阶段,展示抽象语法树
首先指定表,从例子可以看出指定emp表,然后是否把查询结构插入到另一个表,由这个例子仅仅是查询,所以insert这部分为空。最后是查询的字段,由于我们写的是“*”所以展示为 TOK_ALLCOLREF全部字段。 - 第二阶段
展示各个阶段的依赖关系,由于我们这个查询语句过于简单,所以并没有启动MapReduce,只有一个阶段,没有显示出依赖关系 - 第三阶段
对Stage-0这个阶段进行详细解读
TableScan
:查看表
alias: emp
:所需要的表
Statistics: Num rows: 2 Data size: 820 Basic stats: COMPLETE Column stats: NONE
:这张表的基本信息
expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: double), comm (type: double), deptno (type: int)
:表中需要输出的字段及类型
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
:输出的的字段编号
复杂的例子
- 展示数据
hive> select * from dept;
OK
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS
我们来展示一个复杂的join操作
select e.empno,e.ename,e.deptno, d.dname
from emp e join dept d on e.deptno=d.deptno;
- 展示执行计划
hive> explain extended select e.empno,e.ename,e.deptno, d.dname
> from emp e join dept d on e.deptno=d.deptno;
OK
ABSTRACT SYNTAX TREE:
TOK_QUERY
TOK_FROM
TOK_JOIN
TOK_TABREF
TOK_TABNAME
emp
e
TOK_TABREF
TOK_TABNAME
dept
d
=
.
TOK_TABLE_OR_COL
e
deptno
.
TOK_TABLE_OR_COL
d
deptno
TOK_INSERT
TOK_DESTINATION
TOK_DIR
TOK_TMP_FILE
TOK_SELECT
TOK_SELEXPR
.
TOK_TABLE_OR_COL
e
empno
TOK_SELEXPR
.
TOK_TABLE_OR_COL
e
ename
TOK_SELEXPR
.
TOK_TABLE_OR_COL
e
deptno
TOK_SELEXPR
.
TOK_TABLE_OR_COL
d
dname
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: e
Statistics: Num rows: 7 Data size: 820 Basic stats: COMPLETE Column stats: NONE
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: deptno is not null (type: boolean)
Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: deptno (type: int)
sort order: +
Map-reduce partition columns: deptno (type: int)
Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
tag: 0
value expressions: empno (type: int), ename (type: string)
auto parallelism: false
TableScan
alias: d
Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: deptno is not null (type: boolean)
Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: deptno (type: int)
sort order: +
Map-reduce partition columns: deptno (type: int)
Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
tag: 1
value expressions: dname (type: string)
auto parallelism: false
Path -> Alias:
hdfs://hadoop:8020/user/hive/warehouse/dept [d]
hdfs://hadoop:8020/user/hive/warehouse/emp [e]
Path -> Partition:
hdfs://hadoop:8020/user/hive/warehouse/dept
Partition
base file name: dept
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns deptno,dname,loc
columns.comments
columns.types int:string:string
field.delim
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://hadoop:8020/user/hive/warehouse/dept
name default.dept
numFiles 1
numRows 0
rawDataSize 0
serialization.ddl struct dept { i32 deptno, string dname, string loc}
serialization.format
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 80
transient_lastDdlTime 1515473970
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns deptno,dname,loc
columns.comments
columns.types int:string:string
field.delim
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://hadoop:8020/user/hive/warehouse/dept
name default.dept
numFiles 1
numRows 0
rawDataSize 0
serialization.ddl struct dept { i32 deptno, string dname, string loc}
serialization.format
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 80
transient_lastDdlTime 1515473970
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.dept
name: default.dept
hdfs://hadoop:8020/user/hive/warehouse/emp
Partition
base file name: emp
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns empno,ename,job,mgr,hiredate,sal,comm,deptno
columns.comments
columns.types int:string:string:int:string:double:double:int
field.delim
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://hadoop:8020/user/hive/warehouse/emp
name default.emp
numFiles 1
numRows 0
rawDataSize 0
serialization.ddl struct emp { i32 empno, string ename, string job, i32 mgr, string hiredate, double sal, double comm, i32 deptno}
serialization.format
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 820
transient_lastDdlTime 1515411106
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns empno,ename,job,mgr,hiredate,sal,comm,deptno
columns.comments
columns.types int:string:string:int:string:double:double:int
field.delim
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://hadoop:8020/user/hive/warehouse/emp
name default.emp
numFiles 1
numRows 0
rawDataSize 0
serialization.ddl struct emp { i32 empno, string ename, string job, i32 mgr, string hiredate, double sal, double comm, i32 deptno}
serialization.format
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 820
transient_lastDdlTime 1515411106
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.emp
name: default.emp
Truncated Path -> Alias:
/dept [d]
/emp [e]
Needs Tagging: true
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 deptno (type: int)
1 deptno (type: int)
outputColumnNames: _col0, _col1, _col7, _col12
Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col7 (type: int), _col12 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
GlobalTableId: 0
directory: hdfs://hadoop:8020/tmp/hive/hadoop/ad410998-9c98-481e-9cb7-dd4262272f21/hive_2018-01-29_16-58-28_302_3010477556100147411-1/-mr-10000/.hive-staging_hive_2018-01-29_16-58-28_302_3010477556100147411-1/-ext-10001
NumFilesPerFileSink: 1
Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
Stats Publishing Key Prefix: hdfs://hadoop:8020/tmp/hive/hadoop/ad410998-9c98-481e-9cb7-dd4262272f21/hive_2018-01-29_16-58-28_302_3010477556100147411-1/-mr-10000/.hive-staging_hive_2018-01-29_16-58-28_302_3010477556100147411-1/-ext-10001/
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
columns _col0,_col1,_col2,_col3
columns.types int:string:int:string
escape.delim \
hive.serialization.extend.additional.nesting.levels true
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TotalFiles: 1
GatherStats: false
MultiFileSpray: false
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.248 seconds, Fetched: 238 row(s)
- 第一阶段
- 在语法树中的TOK_FROM中我们可以看到这两张表及他们需要连接的条件
- 在TOK_INSERT中因为我们没有指定将查询信息写入哪张表,所以依旧为空
- 在TOK_SELECT中,显示了我们要查询的字段,逐个显示。
- 第二阶段
可以十分明显的看出依赖关系,stage-0依赖于stage-1 - 第三阶段
对每个stage都进行详细展示
- stage-1:
1.首先先将两张表分开来看,看第一张表 emp表
2. 需要查看表,对表的基本信息查表
3.Filter Operator
进行过滤操作将detpno为空的全部过滤掉
isSamplingPred: false
predicate: deptno is not null (type: boolean)
4.Reduce Output Operator
:展示要输入到reduce的字段主要有deptno、empno、ename。
key expressions: deptno (type: int)
sort order: +
Map-reduce partition columns: deptno (type: int)
Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
tag: 0
value expressions: empno (type: int), ename (type: string)
auto parallelism: false
5. 另外一张表也不尽相同,只不过所要传入reduce的字段不同:deptno、dname。
6. reduce阶段开始
7.Join Operator
:join的类型
condition map:
Inner Join 0 to 1
8.keys:
:连接字段
0 deptno (type: int)
1 deptno (type: int)
9.outputColumnNames: _col0, _col1, _col7, _col12
:输出字段,及字段的编号和类型。
Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col7 (type: int), _col12 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
10.input format: org.apache.hadoop.mapred.TextInputFormat
:输出的格式
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat - stage-0:主要做一些收尾作业