第11章 Hive:SQL on Hadoop
11.8 HQL:排序
11.8.1 order by
Hive 中的 order by与SQL 中的order by语义相同,会对查询结果进行全局排序,但是Hive 语句最终要转换为 MapReduce 程序放到 Hadoop 分布式集群上去执行,多个Mapper 后汇集到一个 Reducer 上执行,如果结果数据量大,那就会造成 Reduce 执行相当漫长。
hive> select * from emp
> order by hiredate;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170814100601_114fbe7b-a71a-4575-8a85-582f83ead490
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1502717288397_0001, Tracking URL = http://node1:8088/proxy/application_1502717288397_0001/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502717288397_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-08-14 10:06:36,164 Stage-1 map = 0%, reduce = 0%
2017-08-14 10:06:57,150 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.01 sec
2017-08-14 10:07:13,160 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.46 sec
MapReduce Total cumulative CPU time: 7 seconds 460 msec
Ended Job = job_1502717288397_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.46 sec HDFS Read: 10295 HDFS Write: 821 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 460 msec
OK
7369 SMITH CLERK 7902 1980-12-17 800.0 0.0 20
7499 ALLEN SALESMAN 7698 1981-02-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-04-02 2975.0 0.0 20
7698 BLAKE MANAGER 7839 1981-05-01 2850.0 0.0 30
7782 CLARK MANAGER 7839 1981-06-09 2450.0 0.0 10
7844 TURNER SALESMAN 7698 1981-09-08 1500.0 0.0 30
7654 MARTIN SALESMAN 7698 1981-09-28 1250.0 1400.0 30
7839 KING PRESIDENT 0 1981-11-17 5000.0 0.0 10
7900 JAMES CLERK 7698 1981-12-03 950.0 0.0 30
7902 FORD ANALYST 7566 1981-12-03 3000.0 0.0 20
7934 MILLER CLERK 7782 1982-01-23 1300.0 0.0 10
Time taken: 74.031 seconds, Fetched: 12 row(s)
hive>
hive> select * from emp
> where did=30
> order by sal desc,eid asc;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170814101221_ac53277e-41d7-42a7-aa11-3ba2fe041795
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1502717288397_0002, Tracking URL = http://node1:8088/proxy/application_1502717288397_0002/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1502717288397_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-08-14 10:12:45,371 Stage-1 map = 0%, reduce = 0%
2017-08-14 10:13:06,327 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.0 sec
2017-08-14 10:13:23,493 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 9.38 sec
MapReduce Total cumulative CPU time: 9 seconds 380 msec
Ended Job = job_1502717288397_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 9.38 sec HDFS Read: 10967 HDFS Write: 462 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 380 msec
OK
7698 BLAKE MANAGER 7839 1981-05-01 2850.0 0.0 30
7499 ALLEN SALESMAN 7698 1981-02-20 1600.0 300.0 30
7844 TURNER SALESMAN 7698 1981-09-08 1500.0 0.0 30
7521 WARD SALESMAN 7698 1981-02-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-09-28 1250.0 1400.0 30
7900 JAMES CLERK 7698 1981-12-03 950.0 0.0 30
Time taken: 65.629 seconds, Fetched: 6 row(s)
hive>
请注意 , Hive中尽量不要用Order By, 除非非常确定结果集很小 。但是排序的需求总是有的,Hive 中使用下面的几种排序来满足需求。
11.8.2 sort by
sort by是在每个reduce中进行排序,是一个局部排序,可以保证每个 Reducer 中是按照指定字段进行排好序的,但是全局上来说,相同字段可以被分配到不同的Reduce 上,虽然在各个 Reduce 上是排好序的,但是全局上不一定是排好序的。
在执行Hive查询时,可以看到如下信息:
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
- hive.exec.reducers.bytes.per.reducer
此参数从Hive 0.2.0开始引入。在Hive 0.14.0版本之前默认值是1G;而从Hive 0.14.0开始以后的版本,默认值变成了256M。 - mapred.reduce.tasks
如果设置了mapred.reduce.tasks,那么Hive会直接使用它的值作为Reduce的个数;
如果mapred.reduce.tasks的值没有设置(默认值就是-1),那么Hive会根据输入文件的大小估算出Reduce的个数。根据输入文件估算Reduce的个数可能未必很准确,因为Reduce的输入是Map的输出,而Map的输出可能会比输入要小,所以最准确的数根据Map的输出估算Reduce的个数。
[root@node3 ~]# cd /opt/hive-2.1.1/conf/
[root@node3 conf]# vi hive-site.xml
[root@node3 conf]# cat hive-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.local</name>
<!--本地存储,MySQL和hive安装在同一节点-->
<value>true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://node3:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<!--mysql数据库用户名-->
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>2</value>
</property>
</configuration>
[root@node3 conf]#
hive> select eid,ename,sal from emp sort by sal;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170820064647_81b26e52-d158-4861-89f0-677ec0db0b52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1503220733636_0009, Tracking URL = http://node1:8088/proxy/application_1503220733636_0009/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1503220733636_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2017-08-20 06:47:02,958 Stage-1 map = 0%, reduce = 0%
2017-08-20 06:47:12,622 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.47 sec
2017-08-20 06:47:24,950 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.35 sec
MapReduce Total cumulative CPU time: 6 seconds 350 msec
Ended Job = job_1503220733636_0009
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 2 Cumulative CPU: 6.35 sec HDFS Read: 12635 HDFS Write: 559 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 350 msec
OK
8888 HADRON NULL
7369 SMITH 800.0
7900 JAMES 950.0
7521 WARD 1250.0
7934 MILLER 1300.0
7844 TURNER 1500.0
7782 CLARK 2450.0
7698 BLAKE 2850.0
7566 JONES 2975.0
7654 MARTIN 1250.0
7499 ALLEN 1600.0
7902 FORD 3000.0
7839 KING 5000.0
Time taken: 39.092 seconds, Fetched: 13 row(s)
hive>
11.8.3 distribute by
distribute by 指定 map 输出结果怎么样划分后分配到各个 Reduce 上去,比如 distribute by cid,就可以保证 cid 字段相同的结果被分配到同一个 reduce 上去执行。然后再指定 sort by cid,则在 Reduce 上进行按照 cid 进行排序。但是这种还是不能做到全局排序,只能保证排序字段值相同的放在一起,并且在 reduce 上局部是排好序的。
需要注意的是 distribute by 必须写在 sort by 前面。
hive> select eid,ename,sal,did from emp distribute by did sort by sal;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170820090113_637965de-369d-4a26-b233-0ee69513d6f3
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1503220733636_0014, Tracking URL = http://node1:8088/proxy/application_1503220733636_0014/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1503220733636_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2017-08-20 09:01:36,959 Stage-1 map = 0%, reduce = 0%
2017-08-20 09:01:57,587 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.93 sec
2017-08-20 09:02:14,251 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 9.04 sec
MapReduce Total cumulative CPU time: 9 seconds 40 msec
Ended Job = job_1503220733636_0014
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 2 Cumulative CPU: 9.04 sec HDFS Read: 13200 HDFS Write: 598 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 40 msec
OK
8888 HADRON NULL NULL
7369 SMITH 800.0 20
7900 JAMES 950.0 30
7654 MARTIN 1250.0 30
7521 WARD 1250.0 30
7934 MILLER 1300.0 10
7844 TURNER 1500.0 30
7499 ALLEN 1600.0 30
7782 CLARK 2450.0 10
7698 BLAKE 2850.0 30
7566 JONES 2975.0 20
7902 FORD 3000.0 20
7839 KING 5000.0 10
Time taken: 61.594 seconds, Fetched: 13 row(s)
hive>
11.8.4 cluster by
如果 distribute by 和 sort by 的字段是同一个,可以简写为cluster by
hive> select eid,ename,sal,did from emp cluster by sal;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170820090813_f764950e-bcdd-4100-9554-2fc8a9c2bde1
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1503220733636_0015, Tracking URL = http://node1:8088/proxy/application_1503220733636_0015/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job -kill job_1503220733636_0015
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2017-08-20 09:08:28,249 Stage-1 map = 0%, reduce = 0%
2017-08-20 09:08:37,776 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.92 sec
2017-08-20 09:08:49,509 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.1 sec
MapReduce Total cumulative CPU time: 7 seconds 100 msec
Ended Job = job_1503220733636_0015
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 2 Cumulative CPU: 7.1 sec HDFS Read: 13186 HDFS Write: 598 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 100 msec
OK
8888 HADRON NULL NULL
7369 SMITH 800.0 20
7900 JAMES 950.0 30
7654 MARTIN 1250.0 30
7521 WARD 1250.0 30
7934 MILLER 1300.0 10
7844 TURNER 1500.0 30
7499 ALLEN 1600.0 30
7782 CLARK 2450.0 10
7698 BLAKE 2850.0 30
7566 JONES 2975.0 20
7902 FORD 3000.0 20
7839 KING 5000.0 10
Time taken: 37.126 seconds, Fetched: 13 row(s)
hive>