order by

全局排序,hql转换后的mr左右只有一个reduce任务。当数据量比较大时order by就要慎用,很有可能
导致reduce需要较长的时间才能完成,或者完不成。
格式: order by 字段名 [asc|desc]
默认是asc 升序,desc表示降序
位置:
order by语句通常防止hql语句的最后。

sort by

sort by作用:在每一个reduce task任务内部排序,在大量数据集时使用order by存在着效率低下的问
题,很多场景中并不需要全局排序。
每个reduce任务都会对应的结果文件part-r-xxxxxx,在每一个结果文件中都是有序的,全局是无序的。
通过set命令设置reduce任务的数量,有效期是直到下次修改该参数的值或hive连接关闭:

hive> set mapreduce.job.reduces=3;
hive> set mapreduce.job.reduces;
mapreduce.job.reduces=3
hive> select * from emp sort by deptno desc;
Query ID = root_20220119212303_71bc2d21-8eaf-488d-8c84-29ae359e4797
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 3

结果不够直观,将之后的结果文件下载到本地。

hive> insert overwrite local directory '/opt/sortbyresult' select * from emp sort by deptno desc;
[root@node4 sortbyresult]# ll
total 12
-rw-r--r-- 1 root root 285 Jan 19 21:27 000000_0
-rw-r--r-- 1 root root 290 Jan 19 21:27 000001_0
-rw-r--r-- 1 root root 95 Jan 19 21:27 000002_0

distribute by

distribute by:对应MR作业的partition(自定义分区),通常结合sort by一起使用。在某些情况下需
要控制特定的行应该到哪个reduce任务中,为了后续的聚合操作。分区有对应reduce任务,有几个分区
就有几个reduce任务;否则就看不到distribute by的效果。
distribute by分区规则是根据分区字段的hash值与分区数(reduce任务的总数)进行除模后,余数相同
的分到一个分区中。
要求:distribute by语句写在sort by语句的前面。

hive> set mapreduce.job.reduces=4;
hive> insert overwrite local directory '/opt/distributebyresult' select * from emp distribute by deptno sort by empno desc;
Query ID = root_20220119223319_43617d31-8921-4707-9bec-28de8d3a3bd4
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1642581300381_0010, Tracking URL = http://node4:8088/proxy/application_1642581300381_0010/
Kill Command = /opt/hadoop-2.6.5/bin/hadoop job -kill job_1642581300381_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2022-01-19 22:33:45,351 Stage-1 map = 0%, reduce = 0%
2022-01-19 22:33:57,928 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.84 sec
2022-01-19 22:34:08,730 Stage-1 map = 100%, reduce = 25%, Cumulative CPU 6.77 sec
2022-01-19 22:34:27,183 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 12.53 sec
2022-01-19 22:34:45,360 Stage-1 map = 100%, reduce = 75%, Cumulative CPU 14.34 sec
2022-01-19 22:34:59,227 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 16.54 sec
MapReduce Total cumulative CPU time: 16 seconds 540 msec
Ended Job = job_1642581300381_0010
Copying data to local directory /opt/distributebyresult
Copying data to local directory /opt/distributebyresult
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 16.54 sec HDFS Read: 19263 HDFS Write: 670 SUCCESS
Total MapReduce CPU Time Spent: 16 seconds 540 msec
OK
Time taken: 105.24 seconds
[root@node4 distributebyresult]# ll
total 8
-rw-r--r-- 1 root root 516 Jan 19 22:35 000000_0
-rw-r--r-- 1 root root 0 Jan 19 22:35 000001_1
-rw-r--r-- 1 root root 154 Jan 19 22:35 000002_0
-rw-r--r-- 1 root root 0 Jan 19 22:35 000003_1
[root@node4 distributebyresult]# cat -A 000000_0
7934^Amiller^Aclerk^A7782^A1982-1-23^A1300.0^A10.0^A\N$
7902^Aford^Aanalyst^A7566^A1981-12-3^A3000.0^A20.0^A\N$
7900^Ajames^Aclerk^A7698^A1981-12-3^A950.0^A30.0^A\N$
7876^Aadams^Aclerk^A7788^A1987-5-23^A1100.0^A20.0^A\N$
7844^Aturner^Asalesman^A7698^A1981-9-8^A\N^A\N^A\N$
7839^Aking^Apresident^A\N^A5000.00 10^A\N^A\N^A\N$
7788^Ascott^Aanalyst^A7566^A1987-4-19^A3000.0^A20.0^A\N$
7782^Aclark^Amanager^A7839^A1981-6-9^A2450.0^A10.0^A\N$
7698^Ablake^Amanager^A7839^A1981-5-1^A2850.0^A30.0^A\N$
7566^Ajones^Amanager^A7839^A1981-4-2^A2975.0^A20.0^A\N$
7369^Asmith^Aclerk^A7902^A1980-12-17^A800.0^A20.0^A\N$
[root@node4 distributebyresult]# cat -A 000001_1
[root@node4 distributebyresult]# cat -A 000003_1
[root@node4 distributebyresult]# cat -A 000002_0
7654^Amartin^Asalesman^A7698^A1981-9-28^A1250.0^A1400.0^A30$
7521^Award^Asalesman^A7698^A1981-2-22^A1250.0^A500.0^A30$
7499^Aallen^Asalesman^A7698^A1981-2-20^A1600.0^A300.0^A30$

cluster by

当distribute by和sort by后面的字段相同时,可以使用cluster by进行简化。功能是等价的;但是只能使
用升序排序,不能指定排序规则为asc或者desc。

hive>select * from emp distribute by deptno sort by deptno;
#可以简化为
hive>select * from emp cluster by deptno;
hive> insert overwrite local directory '/opt/clusterbyresult' select * from emp cluster by deptno;
Query ID = root_20220119224138_5328d619-8a1b-4062-9eaa-007ba3846a72
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1642581300381_0011, Tracking URL = http://node4:8088/proxy/application_1642581300381_0011/
Kill Command = /opt/hadoop-2.6.5/bin/hadoop job -kill job_1642581300381_0011
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2022-01-19 22:41:46,757 Stage-1 map = 0%, reduce = 0%
2022-01-19 22:42:06,745 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 11.66 sec
2022-01-19 22:42:16,727 Stage-1 map = 100%, reduce = 25%, Cumulative CPU 15.04 sec
2022-01-19 22:42:33,601 Stage-1 map = 100%, reduce = 42%, Cumulative CPU 24.61 sec
2022-01-19 22:42:35,917 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 25.92 sec
2022-01-19 22:42:50,674 Stage-1 map = 100%, reduce = 75%, Cumulative CPU 29.13 sec
2022-01-19 22:42:53,859 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 34.92 sec
MapReduce Total cumulative CPU time: 34 seconds 920 msec
Ended Job = job_1642581300381_0011
Copying data to local directory /opt/clusterbyresult
Copying data to local directory /opt/clusterbyresult
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 34.92 sec HDFS Read: 19271 HDFS Write: 670 SUCCESS
Total MapReduce CPU Time Spent: 34 seconds 920 msec
OK
Time taken: 78.887 seconds
[root@node4 clusterbyresult]# ll
total 8
-rw-r--r-- 1 root root 516 Jan 19 22:42 000000_1
-rw-r--r-- 1 root root 0 Jan 19 22:42 000001_0
-rw-r--r-- 1 root root 154 Jan 19 22:42 000002_0
-rw-r--r-- 1 root root 0 Jan 19 22:42 000003_0
[root@node4 clusterbyresult]# cat -A 000000_1
7934^Amiller^Aclerk^A7782^A1982-1-23^A1300.0^A10.0^A\N$
7902^Aford^Aanalyst^A7566^A1981-12-3^A3000.0^A20.0^A\N$
7900^Ajames^Aclerk^A7698^A1981-12-3^A950.0^A30.0^A\N$
7876^Aadams^Aclerk^A7788^A1987-5-23^A1100.0^A20.0^A\N$
7844^Aturner^Asalesman^A7698^A1981-9-8^A\N^A\N^A\N$
7839^Aking^Apresident^A\N^A5000.00 10^A\N^A\N^A\N$
7788^Ascott^Aanalyst^A7566^A1987-4-19^A3000.0^A20.0^A\N$
7782^Aclark^Amanager^A7839^A1981-6-9^A2450.0^A10.0^A\N$
7698^Ablake^Amanager^A7839^A1981-5-1^A2850.0^A30.0^A\N$
7566^Ajones^Amanager^A7839^A1981-4-2^A2975.0^A20.0^A\N$
7369^Asmith^Aclerk^A7902^A1980-12-17^A800.0^A20.0^A\N$
[root@node4 clusterbyresult]# cat -A 000001_0
[root@node4 clusterbyresult]# cat -A 000002_0
7521^Award^Asalesman^A7698^A1981-2-22^A1250.0^A500.0^A30$
7654^Amartin^Asalesman^A7698^A1981-9-28^A1250.0^A1400.0^A30$
7499^Aallen^Asalesman^A7698^A1981-2-20^A1600.0^A300.0^A30$
[root@node4 clusterbyresult]# cat -A 000003_0
[root@node4 clusterbyresult]#