有很多刚学习的小伙伴,想学习参数优化,又不知道有哪些参数,如何优化。

给大家,说明思路。

1.set  你知道在这里找,说明你知道实际生效的作用

2.hive-site.default.xml,你知道在这里找,说明你知道这个配置文件的作用(cdp-hive3.1的部分配置hive官网都没有。。在clouder的官网)

3.官网。官网永远是学习最好的地方,不用多说

Configuration Properties - Apache Hive - Apache Software Foundation

下面再详细说下各个参数的优化及实操过程。

一、分区裁剪,列裁剪

思想:就是减少读取的数据量。

比如a 1000w b 1000w 这两个表join耗时,但是b过滤只有1w了这个时候join就很简单了。

比如 a 有100个字段 b有100个字段,其实我们只需要a b的几个字段,如果还是用*那么导致读取的数据量很多

关于分区裁剪 其实就是谓词下推,目前hive版本都支持的,就是把where条件推到下游。

二、count(distinct) 和group by

这个很多人都说的都是distinct只有一个reduce 会把所有数据放到一个reduce导致处理的很慢?

那么为什么只有一个reduce呢?有没有人想过,我不是可以设置reduce个数 reduce处理量么?

我看到一个说法是:因为distinct 内部会给所有数据进行hash 然后排序,这样只能在一个reduce里。

那么group by 呢? 这个是可以手动控制map数和reduce个数的。

同时注意参数

map端预聚合 理解为combiner

set hive.map.aggr=true --Whether to use map-side aggregation in Hive Group By queries

map端聚合的数目条数

set hive.groupby.mapaggr.checkinterval=100000--Number of rows after which size of the grouping keys/aggregation classes is performed

倾斜处理 类似与加盐打散 a 变成 a1 a2 a3 ...a9 然后合并为a

set hive.groupby.skewindata=true --Whether there is skew in data to optimize group by queries

三、分区 分桶

这个目的也是为了将文件切分。读取的文件数量少,自然速度就快

这个懒得说了,一般来说数据量多一点就可以分区,然后特别多的话可以分桶

四、并行

set hive.exec.parallel=true; //默认关闭,可以开启并发执行

set hive.exec.parallel.thread.number=8; //同一个sql允许最大并行度,默认为8

说下这两个参数,顾名思义并行处理任务,大部分情况下能够加快数据处理时间,特别是多个表join ,例如 (a join b) ab  join (c join d) cd   on ab.key=cd.key。

但是注意 这个并行开启了,客户端时看不了日志的,例如stage task成功了几个失败了几个,

而且我由此遇到过一个任务不开并行能够跑成功,开了跑了一半必失败。。。

而且开并行对资源的耗费也比较大,看情况慎用

五、mapjoin

可以看到大多数文章都逃不过mapjoin这个说法,那么mapjoin是什么呢?其实就是将这个表的数据分发到各个节点或者map任务中。

注意这句话!!mapjoin 无shuffle 无reduce!!可能很多人见过这句话,但是不知道话会有什么作用,比如有的用了mapjoin还想调整reduce的个数?门都没有。

set hive.auto.convert.join=true; //false in 0.7.0 to 0.10.0; true in 0.11.0 and later
set hive.auto.convert.join.noconditionaltask=true; 
set hive.auto.convert.join.noconditionaltask.size=10000000; // 10M
set hive.mapjoin.smalltable.filesize=25000000; //小表阈值25M

--关于 noconditionaltask

Whether Hive enables the optimization about converting common join into mapjoin based on the input file size. \n" + "If this parameter is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the\n" + "specified size, the join is directly converted to a mapjoin (there is no conditional task).查了下其他人的翻译,各有千秋这个翻译靠谱点。Hive在基于输入文件大小的前提下将普通JOIN转换成MapJoin,并是否将多个MJ合并成一个,多个MJ合并成一个MJ时,其表的总的大小须小于该值,同时hive.auto.convert.join.noconditionaltask必须为true

小表 odsiadata.label_dict -- 29532=28.8 K  cnt=596

大表 odsiadata.cooperation_label_entity -- 310M   cnt=4,744,974

那么怎么知道mapjoin起作用了吗?

explain 
select a.*,b.*
from odsiadata.cooperation_label_entity a  
join odsiadata.label_dict b 
on a.label_key=b.label_key;

采用默认参数 即 auto.mapjoin=true 和 .smalltable.filesize=25M

结果是mapjoin 大表left join 小表也会自动mapjoin

explain 
select a.*,b.*
from odsiadata.label_dict a  
left join odsiadata.cooperation_label_entity b 
on a.label_key=b.label_key
采用默认参数,注意这里 小表作为主表,map join失效 会有reduce

那如果我现在就不想让他mapjoin呢?或者说我怎么知道哪个参数真正起作用了?

上图有个hive.auto.convert.join.noconditionaltask.size=10000000=10M

还有个hive.mapjoin.smalltable.filesize=25000000=25M

如果我的小表大小是20M 是走mapjoin还是reduce呢?这个问题你们可以问下那些博主有几个能答出来。

然后我们这个表的大小到底是多少呢?

hive jvm调优 hive调优参数_大数据

 可以看到

大表=325492973 =310M

小表=29532=28.8k

那么上面的两个参数就真的10M 25M指的是29532吗? 有人真的研究过吗?我们还是认真研究下explain把。

alias: a  Statistics: Num rows: 4744974 Data size: 5863719621 Basic stats: COMPLETE Column stats: PARTIAL

a是大表 numrows和我们的count对的上,但是这个datasize是啥?差不多5.59G了

alias: b  Statistics: Num rows: 596 Data size: 328294 Basic stats: COMPLETE Column stats: COMPLETE

b是小表 numrows和我们的count也对的上,这个datasize好像也有点对不上

这里我随便猜猜,怀疑是orc文件自带的压缩,然后解压后的大小?随便说说。反正注意这个数字了!!!

然后来验证。小表开始mapjoin的这个阈值到底是啥?

hive.auto.convert.join.noconditionaltask.size=328293 --小于328294

hive.mapjoin.smalltable.filesize=25000000

采用reduce。

hive.auto.convert.join.noconditionaltask.size=328294 --等于328294

hive.mapjoin.smalltable.filesize=25000000

采用mapjoin,这说明了啥?真正起作用的是explain的参数。

 然后我突然有对filesize的25M感兴趣了,这个作用又是啥。

测试结果居然是。。。hive.mapjoin.smalltable.filesize=1 仍然走mapjoin。这时候我不得不怀疑csdn上这么多的文章难道都是抄袭的?

其实我到这里已经有了一个大胆的猜想,对他们都是抄袭的,抄nm哟!!!!

这个参数本身没有错,但是用错了位置!!!因为我已经吃了csdn很多亏了。

注意注意。注意你们用的什么引擎!!!!

都2022年了,大家都是spark tez了。有些b还活在很多年前,还是mr的那一套优化,都过时了老哥。

set hive.execution.engine=mr 来测试这个smallfile.

结果....发现smalltable.filesize卵用没有,如果有用的话请求下博客地址我去学习点赞收藏。

set hive.execution.engine=mr

set hive.mapjoin.smalltable.filesize=

set hive.auto.convert.join.noconditionaltask.size=20971520

走的是mapjoin 说明啥 这个filesize卵用没有。

set hive.execution.engine=mr

set hive.mapjoin.smalltable.filesize=

set hive.auto.convert.join.noconditionaltask.size=328294

走的是mapjoin

set hive.auto.convert.join.noconditionaltask.size=328293 --这个小于explain的size

走的是mapjoin

set hive.auto.convert.join.noconditionaltask.size=29532 --原始大小29532

走的mapjoin


set hive.auto.convert.join.noconditionaltask.size=29531

走的reduce

说明了啥mr 和spark对这个size的是不一样的。

我们来总结下。

engine=spark的时候,大表关联小表,我们要注意小表explain的大小,而不是实际大小 

engine=mr的时候,大表关联小表,我们要注意小表的实际大小也就是hdfs dfs -du 的大小

特别注意 大表关联小表 如果想使用mapjoin 是不能 小表left join 大表的!!!

hive jvm调优 hive调优参数_sed_02

 最后附上nocondtionaltask 哪个解释,我在源码找到这里,感觉会稍微清晰点。maxsize就是。

hive jvm调优 hive调优参数_hive jvm调优_03

因能力有限不能完全研读hive源码,这里不能给出明确的答案,但是告诉大家网上的文章很多都有错误,都需要自己实践。

2022-09-05更新map join 之前就说过了那个smallsize 没啥卵用。但是看不懂源码,也没找到官方文档。结果今天我找到了。LanguageManual JoinOptimization - Apache Hive - Apache Software Foundation

Optimize Auto Join Conversion

When auto join is enabled, there is no longer a need to provide the map-join hints in the query. The auto join option can be enabled with two configuration parameters:

这里说的是如何优化convert.join。官网给出的就是下面两个参数,并没有提到smallsize

set hive.auto.convert.join.noconditionaltask = true; set hive.auto.convert.join.noconditionaltask.size = 10000000;

The default for hive.auto.convert.join.noconditionaltask is true which means auto conversion is enabled. (Originally the default was false – see HIVE-3784 – but it was changed to true by HIVE-4146 before Hive 0.11.0 was released.)

The size configuration enables the user to control what size table can fit in memory. This value represents the sum of the sizes of tables that can be converted to hashmaps that fit in memory. Currently, n-1 tables of the join have to fit in memory for the map-join optimization to take effect. There is no check to see if the table is a compressed one or not and what the potential size of the table can be. The effect of this assumption on the results is discussed in the next section.

For example, the previous query just becomes:

select count(*) from store_sales join time_dim on (ss_sold_time_sk = t_time_sk) join date_dim on (ss_sold_date_sk = d_date_sk) where t_hour = 8 and d_year = 2002

If time_dim and date_dim fit in the size configuration provided, the respective joins are converted to map-joins. If the sum of the sizes of the tables can fit in the configured size, then the two map-joins are combined resulting in a single map-join.

就是 date 8M time 8M 就会搞两个mapjoin,

date4M time 4M  4+4<10就会搞一个mapjoin

This reduces the number of MR-jobs required and significantly boosts the speed of execution of this query. This example can be easily extended for multi-way joins as well and will work as expected.

Outer joins offer more challenges. Since a map-join operator can only stream one table, the streamed table needs to be the one from which all of the rows are required. For the left outer join, this is the table on the left side of the join; for the right outer join, the table on the right side, etc. This means that even though an inner join can be converted to a map-join, an outer join cannot be converted. An outer join can only be converted if the table(s) apart from the one that needs to be streamed can be fit in the size configuration. A full outer join cannot be converted to a map-join at all since both tables need to be streamed.

Current Optimization

  1. Group as many MJ operators as possible into one MJ.

As Hive goes through the conversion to map-joins for join operators based on the configuration flags, an effort is made at the end of these conversions to group as many together as possible. Going through in a sequence, if the sum of the sizes of the tables participating in the individual map-join operators is within the limit configured by the noConditionalTask.size flag, these MJ operators are combined together. This ensures more speedup with regard to these queries.

这里说的就是 合并为一个mapjoin 是比多个mapjoin 效率高的

 6、SMBJOIN

因为我这里没怎么用过分桶表不好测试。看看文档即可。

Auto Conversion to SMB Map Join

Sort-Merge-Bucket (SMB) joins can be converted to SMB map joins as well. SMB joins are used wherever the tables are sorted and bucketed. The join boils down to just merging the already sorted tables, allowing this operation to be faster than an ordinary map-join. However, if the tables are partitioned, there could be a slow down as each mapper would need to get a very small chunk of a partition which has a single key.

The following configuration settings enable the conversion of an SMB to a map-join SMB:

set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true;

There is an option to set the big table selection policy using the following configuration:

set hive.auto.convert.sortmerge.join.bigtable.selection.policy = org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ;

By default, the selection policy is average partition size. The big table selection policy helps determine which table to choose for only streaming, as compared to hashing and streaming.

The available selection policies are:

org.apache.hadoop.hive.ql.optimizer.AvgPartitionSizeBasedBigTableSelectorForAutoSMJ (default) org.apache.hadoop.hive.ql.optimizer.LeftmostBigTableSelectorForAutoSMJ org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ

The names describe their uses. This is especially useful for the fact-fact join (query 82 in the TPC DS benchmark).

SMB Join across Tables with Different Keys

If the tables have differing number of keys, for example Table A has 2 SORT columns and Table B has 1 SORT column, then you might get an index out of bounds exception.

The following query results in an index out of bounds exception because emp_person let us say for example has 1 sort column while emp_pay_history has 2 sort columns.

Error Hive 0.11

SELECT p.*, py.*

FROM emp_person p INNER JOIN emp_pay_history py

ON   p.empid = py.empid

This works fine.

Working query Hive 0.11

SELECT p.*, py.*

FROM emp_pay_history py INNER JOIN emp_person p

ON   p.empid = py.empid

 ———————————————后续慢慢写,尽量保证测试了发出来———————————