1. 调整reduce个数(方式1)
-- 每个reduce处理的数据量(默认为256M)
set hive.exec.reducers.bytes.per.reducer=256000000;
-- 每个job允许最大的reduce个数
set hive.exec.reducers.max=1009;
-- 计算reduce个数公式
reduce个数=min(参数2,总输入数量/参数1)
注意 : mapreduce.job.reduces=-1 时生效
测试1 : 文件个数1、文件大小34.8M、每个reduce处理数据量20M
-- 测试1 : 文件个数1、文件大小34.8M、每个reduce处理数据量20M

set mapreduce.job.reduces=-1;
set hive.exec.reducers.bytes.per.reducer=20971520;
SET hive.merge.mapfiles = true;
SET hive.merge.mapredfiles=true; -- 任务结束时,对小文件进行合并
set hive.merge.size.per.task=100;
set yarn.scheduler.maximum-allocation-mb=118784;
set mapreduce.map.memory.mb=4096;
set mapreduce.reduce.memory.mb=4096;
set yarn.nodemanager.vmem-pmem-ratio=4.2;

create table mergeTab4 as
select
substr(uploader,0,1)
,count(1)
from gulivideo_user_ori
group by substr(uploader,0,1);

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2

 

2. 调整reduce个数(方式2)
-- 设置每个job中reduce Task个数
-- 可以在 mapred-default.xml 设置
set mapreduce.job.reduces=3;
测试2 :
set mapreduce.job.reduces=3;
set yarn.scheduler.maximum-allocation-mb=118784;
set mapreduce.map.memory.mb=4096;
set mapreduce.reduce.memory.mb=4096;
set yarn.nodemanager.vmem-pmem-ratio=4.2;

create table mergeTab7 as
select
substr(uploader,0,1)
,count(1)
from gulivideo_user_ori
group by substr(uploader,0,1);

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
162  486  /user/hive/warehouse/home.db/mergetab7/000000_0
157  471  /user/hive/warehouse/home.db/mergetab7/000001_0
163  489  /user/hive/warehouse/home.db/mergetab7/000002_0
3. 思考 : reduce 个数是不是越多越好?
    不是
        1. 过多设置reduceTask数, 启动和初始化时间远大于每个任务的处理时间,会浪费资源和时间
        2. 过多的reduceTask,会导致小文件过多

4. reduceTask 个数设置多少合适 ?
    使单个reduceTask 处理合适的数据量