遇到这样一个问题:

hive> desc ljn001;
OK
name    string
value  int
hive> select * from ljn001;
OK
wang5  92
zhang3  87
li4    73
然后我想使用UDF实现的分析函数row_number去查询value最小的name。如果不清楚UDF实现row_number的原来或者不知道如何使用row_number,请参考:http://www.linuxidc.com/Linux/2013-11/92554.htm
查询SQL如下:
select name
        from
        (select name,row_number(1) as rn
                  from
                  (select *
                            from ljn001 order by value
                  ) a
          ) a
 where rn = 1;
但是结果却意想不到:
OK
wang5
竟然把value最大的给取出来了!不筛选rn再看一下:
select name,rn
        from
        (select name,row_number(1) as rn
                  from
                  (select *
                            from ljn001 order by value
                  ) a
          ) a;
OK
li4    1
zhang3  2
wang5  3
明明是li4对应的1,为什么筛选rn = 1却得到的是wang5 ?
看一下执行计划,豁然开朗了:
explain select name
        from
        (select name,row_number(1) as rn
                  from
                  (select *
                            from ljn001 order by value
                  ) a
          ) a
 where rn = 1;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
    Alias -> Map Operator Tree:
        a:a:ljn001
          TableScan
            alias: ljn001
            Filter Operator
              predicate:
                  expr: (row_number(1) = 1)
                  type: boolean
              Select Operator
                expressions:
                      expr: name
                      type: string
                      expr: value
                      type: int
                outputColumnNames: _col0, _col1
                Reduce Output Operator
                  key expressions:
                        expr: _col1
                        type: int
                  sort order: +
                  tag: -1
                  value expressions:
                        expr: _col0
                        type: string
                        expr: _col1
                        type: int
      Reduce Operator Tree:
        Extract
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: row_number(1)
                  type: bigint
            outputColumnNames: _col0, _col1
            Filter Operator
              predicate:
                  expr: (_col1 = 1)
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: string
                outputColumnNames: _col0
                File Output Operator
                  compressed: true
                  GlobalTableId: 0
                  table:
                      input format: org.apache.Hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  Stage: Stage-0
    Fetch Operator
      limit: -1

执行计划有点长,重点看Stage-1的Map Operator Tree的TableScan ljn001,在扫描ljn001表的时候,发现直接有一个Filter Operator过滤条件:expr: (row_number(1) = 1)

这就是问题所在。Hive在生成执行计划的时候,看到row_number函数传入的是一个常量1,认为得到的结果也会是一个常量,因此没有在外层循环做筛选,而是直接优化在内层扫描表的时候就做了筛选,Hive优化器做的这步操作叫做谓词下推(Predicate pushdown)。这样,order by对于row_number就不起作用了,导致把错误的wang5给查询出来了。

经过分析后发现是我row_number函数写的有问题。和传统数据库的函数一样,Hive的UDF也分为deterministic和non-deterministic。deterministic函数,即输入确定输出就是确定的函数,而non-deterministic函数在输入确定的情况下输出也不能确定,如rand()。Hive中大部分系统自带UDF函数都是deterministic的,目前只发现rand和unix_timestamp是non-deterministic的。用户自己编写的UDF要继承UDF基类或者GenericUDF基类,它们默认都是deterministic的。因此我们在编写row_number这类non-deterministic的分析函数时,要显著注释该函数为non-deterministic的。注释方法是在创建自定义类前注释@UDFType(deterministic = false),此外要import org.apache.Hadoop.hive.ql.udf.UDFType这个包。

修改好row_number函数后重新查看执行计划:

STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:a:ljn001
TableScan
alias: ljn001
Select Operator
expressions:
expr: name
type: string
expr: value
type: int
outputColumnNames: _col0, _col1
Reduce Output Operator
key expressions:
expr: _col1
type: int
sort order: -
tag: -1
value expressions:
expr: _col0
type: string
expr: _col1
type: int
Reduce Operator Tree:
Extract
Select Operator
expressions:
expr: _col0
type: string
expr: row_number(1)
type: int
outputColumnNames: _col0, _col1
Filter Operator
predicate:
expr: (_col1 = 1)
type: boolean
Select Operator
expressions:
expr: _col0
type: string
outputColumnNames: _col0
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1

这时候Filter Operator已经放到Reduce Operator Tree中去做了,即没有进行谓词下推。

注:发现Hive最新的版本的Hive优化器在遇到distribute by、sort by、order by这些关键词后对所有函数(不管是deterministic还是non-deterministic)都不做谓词下推操作,个人认为这种优化方法其实并不是最优的。