你可以不够优秀,但是不要甘于平凡。-- 阿尼古

Hive中的四种排序

order by 、 sort by 、 distribute by 、cluster by

首先贴一下官网对于Hive排序的几种方法的介绍:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

总述

order by

  • 全局排序;
  • 对输入的数据做排序,故此只有一个reducer(多个reducer无法保证全局有序);
  • 只有一个reducer,会导致当输入规模较大时,需要较长的计算时间;

总结:order by会对输入做全局排序,因此只有一个Reducer(多个Reducer无法保证全局有序),然而只有一个Reducer,会导致当输入规模较大时,消耗较长的计算时间。

sort by

  • 非全局排序;
  • 在数据进入reducer前完成排序;
  • 当mapred.reduce.tasks>1时,只能保证每个reducer的输出有序,不保证全局有序;
  • 可以通过设置参数(set mapred.reduce.tasks=n)来指定reducer的个数

总结:sort by不是全局排序,其在数据进入reducer前完成排序,因此,如果用sort by进行排序,并且设置mapred.reduce.tasks>1,则sort by只会保证每个reducer的输出有序,并不保证全局有序。sort by不同于order by,它不受Hive.mapred.mode属性的影响,sort by的数据只能保证在同一个reduce中的数据可以按指定字段排序。使用sort by你可以指定执行的reduce个数(通过set mapred.reduce.tasks=n来指定),对输出的数据再执行归并排序,即可得到全部结果。

distribute by

  • 按照指定的字段对数据进行划分输出到不同的reduce中;
  • 常和sort by一起使用,并且distribute by必须在sort by前面;

总结:distribute by是控制在map端如何拆分数据给reduce端的。hive会根据distribute by后面列,对应reduce的个数进行分发,默认是采用hash算法。sort by为每个reduce产生一个排序文件。在有些情况下,你需要控制某个特定行应该到哪个reducer,这通常是为了进行后续的聚集操作。distribute by刚好可以做这件事。因此,distribute by经常和sort by配合使用。

cluster by

  • 相当于distribute by+sort by,只能默认升序,不能使用倒序;

总结:cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是倒叙排序,不能指定排序规则为ASC或
者DESC。

如果你觉得上面已经能满足你的需求,那你可以不往下面看了。。。

语法详细分析

  • Hhh… 今天一起看英文好不好???哈哈哈…

Order by语法

Hive QL中的ORDER BY语法类似于SQL语言中的ORDER BY语法

英文解释
colOrder: ( ASC | DESC )
colNullOrder: (NULLS FIRST | NULLS LAST)           -- (Note: Available in Hive 2.1.0 and later)
orderBy: ORDER BY colName colOrder? colNullOrder? (',' colName colOrder? colNullOrder?)*
query: SELECT expression (',' expression)* FROM src orderBy

There are some limitations in the “order by” clause. In the strict mode (i.e., hive.mapred.mode=strict), the order by clause has to be followed by a “limit” clause. The limit clause is not necessary if you set hive.mapred.mode to nonstrict. The reason is that in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish.

Note that columns are specified by name, not by position number. However in Hive 0.11.0 and later, columns can be specified by position when configured as follows:

The default sorting order is ascending (ASC).

In Hive 2.1.0 and later, specifying the null sorting order for each of the columns in the “order by” clause is supported. The default null sorting order for ASC order is NULLS FIRST, while the default null sorting order for DESC order is NULLS LAST.

In Hive 3.0.0 and later, order by without limit in subqueries and views will be removed by the optimizer. To disable it, set hive.remove.orderby.in.subquery to false.

中文解释

最上层的语法就不bb了。

首先Hive分为严格模式和非严格模式的。

在严格模式下,order by后面必须要指定limit才可以运行,如果不想加limit就设置为非严格模式;原因是什么呢?order by会去加载全部数据,并在一个reducer里面进行排序然后输出,如果输出的行数非常大的话,那么单个reducer节点会花费很长一段时间去完成这个人任务。

最下面就是版本的变更操作,有一下几点不再叙述版本

  • 可以使用非列名方式而通过select后面的列的位置进行order by指定操作
  • 如果有null值的order by操作,asc升序null在最前面,desc降序nuull在最后面
  • order by取消limit操作 配置参数,默认为false

Sort by 语法

英文解释

The SORT BY syntax is similar to the syntax of ORDER BY in SQL language.

colOrder: ( ASC | DESC )
sortBy: SORT BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src sortBy

Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.

In Hive 3.0.0 and later, sort by without limit in subqueries and views will be removed by the optimizer. To disable it, set hive.remove.orderby.in.subquery to false.

中文解释

最上层语法内容不多bb。

sort by 是在进入reducer之前进行排序的,也就是说每个reducer里面的数据是有序的,但是不能保证全局有序。

排序规则根据字段类型来确定,string、number分别对应不同的排序规则

最下面就是版本的变更操作,有一下几点不再叙述版本

  • sort by 取消limit操作 配置参数,默认为false
  • 使用sort by 可以通过set mapred.reduce.tasks=n 来指定reducer 的个数
语法效果
SELECT key, value FROM src SORT BY key ASC, value DESC;

-- 假设有两个reducer,输出分别如下:
reducer-1:
0   5
0   3
3   6
9   1

reducer-2:
0   4
0   3
1   1
2   5

Distribute By语法

SELECT col1, col2 FROM t1 DISTRIBUTE BY col1
英文解释

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.

For example, we are Distributing By x on the following 5 rows to 2 reducer:

x1
x2
x4
x3
x1

Reducer-1

x1
x2
x1

Reducer-2

x4
x3
中文解释

Distribute by 后面指定字段是对字段进行分发也即是分区的操作,不会进行什么排序操作,通常会和sort by 联用。默认的分区方式是列的hash值来进行分区de。

常用写法如下
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC
			||
SELECT col1, col2 FROM t1 order BY col1 ASC, col2 DESC
sql性能分析

distribute by + sort by 底层采用先分区,然后再将分配好的每一个reducer内部的数据进行排序,近而达到全局排序的效果,大大的优化了sql的性能。然而order by是通过一个reducer进行全局排序的,性能较差。

Cluster by语法

SELECT col1, col2 FROM t1 CLUSTER BY col1
英文解释

Cluster By and Distribute By are used mainly with the Transform/Map-Reduce Scripts. But, it is sometimes useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries.

Cluster By is a short-cut for both Distribute By and Sort By.

Note that all rows with the same key x1 is guaranteed to be distributed to the same reducer (reducer 1 in this case), but they are not guaranteed to be clustered in adjacent positions.

In contrast, if we use Cluster By x, the two reducers will further sort rows on x:

Reducer-1

x1
x1
x2

Reducer-2

x3
x4
中文解释

Cluster by 狭义上讲 约等于 distribute by + sort by

  • cluster by 仅支持asc升序,并且不支持指定排序方式