前提:上个文章记录了我流量表的开发过程,成型后每个分区会有4000文件,不用hive分发+rand()函数会有6万个细碎文件。虽然已经大量减少了细碎文件的产生,但是每天产生4000个,月报4万个文件对于hive起MR过程还是造成不少压力,甚至直接起不来任务。于是做了第二次流量表的优化

优化过程思路:

  1、我先用了DISTRIBUTE BY collectday(动态分区字段) 代替了 DISTRIBUTE BY rand()(hive随机函数)字段。这种方式下每天只产生100-200文件等于每天数据拆开后的动态分区数。但是这种方式下会产生两个比较严重的问题。(1)数据倾斜,导致个别分区数据多的会聚合处理数据太久,分区数据达到压缩后50G(解压大概400Gtxt)。(2)数据量太大长时间的job会导致莫名其妙的问题,比如个别节点下线,IO问题出错。一旦出错就会导致整个JOB执行失败。

       2、于是先跑了几天的数据进行分区。确定了分区下个别数据量比较大的分区。单独提取出这些分区用DISTRIBUTE BY rand()进行处理,而小的分区用DISTRIBUTE BY collectday()进行聚合,避免细碎文件过多。这样一个大文件不至于跑太小,小文件避免了拆分开太多的数据块。

  3、扩展一些思路。这样也就会起了两个JOB。分别去处理大数据和小数据。如果追求时间更快,其实可以sql分开写,这样两个JOB会并行跑,但是会引发一些问题,可能出现一个成功一个失败,这样数据也会不完整,不好发现是否跑完。解决方式可以任务完成给数据库写sql。

 

细碎文件过多在做的过程也大概有三种解决方案:

1.自己写程序通过hadoop api进行聚合

2.通过hive写sql然后配置参数进行聚合

3.通过hive 归档(这个网上说过,自己测勒几次没成功,想想还是通过hive解决最好少了一个步骤也就没深究)

 

拆解后的sql和shell:


##task day
#day=`date +%Y%m%d -d '-1 day'`

#day="20190619"
##log url
logPath="/home/richdm/offline_cal/logs/day_logs/$day"
##start load data
#record_log ${day} 1 1 ##runing
indexpaths=""
#当天时间
day=$1
#三个月前的时间
predata=`date -d "$day -3 month" "+%Y%m%d"`
#由于流量表临近7天数据量比较大,所以把近7天的数据进行拆分reduce处理,避免严重的数据倾斜,否者聚合出来一些reduce会超过300G的数据现在暂定事近7天,次参数可以动态调整
redceday=10
#splitday=`date +%Y%m%d -d -"$redceday"days`
splitday=`date -d "$day - $redceday"days +%Y%m%d`
echo "spiltdata"$splitday
echo 'spiltday'$splitday
echo 'today'$day
echo 'predata'$predata

echo ""
kinit -k -t /hadoop1/key/richdm.keytab richdm@HADOOP.COM
shopt -s expand_aliases
alias hive="beeline -u \"jdbc:hive2://10.217.109.56:2181,10.217.109.57:2181,10.217.109.58:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2\" "
hive -e "
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec; 
SET hive.exec.compress.output=true; 
SET mapreduce.output.fileoutputformat.compress=true; 
SET mapreduce.map.output.compress.codec=com.hadoop.compression.lzo.LzopCodec; 
SET mapreduce.job.reduce.slowstart.completedmaps=1;
SET hive.merge.mapredfiles=true;
SET hive.merge.mapfiles=true;
SET hive.merge.size.per.task=1024000000;
SET hive.merge.smallfiles.avgsize=512000000;
SET hive.exec.reducers.bytes.per.reducer=512000000;
SET mapreduce.map.output.compress=true;
SET mapreduce.input.fileinputformat.split.maxsize=1024000000;
SET mapreduce.map.memory.mb=6096;
SET mapreduce.reduce.memory.mb=10240;
SET mapreduce.reduce.shuffle.input.buffer.percent=0.1;
SET hive.map.aggr=true;
SET hive.groupby.skewindata=true;
SET hive.skewjoin.key=100000;
insert into table terminal.exter_terminal_flowcollect_collectday partition (day='$day',collect_day)
select
sdkversion,
imei1,
imei2,
brand,
model,
version,
masterstatus,
sendtime,
createtime,
appkey,
ip,
iccid1,
iccid2,
imsi1,
imsi2,
lac1,
cellid1,
lac2,
cellid2,
collectday,
dayhour,
nettype1,
nettype2,
totalfloww,
totalflowd,
totalflowd2,
pubnetip,
mac,
collectday as collect_day from terminal.exter_terminal_flowcollect where day = '$day' and  collectday < '$splitday' and collectday >= '$predata'  and rlike (collectday,'[0x30,0x39]') DISTRIBUTE BY collectday;
insert into table terminal.exter_terminal_flowcollect_collectday partition (day='$day',collect_day)
select
sdkversion,
imei1,
imei2,
brand,
model,
version,
masterstatus,
sendtime,
createtime,
appkey,
ip,
iccid1,
iccid2,
imsi1,
imsi2,
lac1,
cellid1,
lac2,
cellid2,
collectday,
dayhour,
nettype1,
nettype2,
totalfloww,
totalflowd,
totalflowd2,
pubnetip,
mac,
collectday as collect_day from terminal.exter_terminal_flowcollect where day = '$day' and  collectday <= '$day' and collectday >= '$splitday'  and rlike (collectday,'[0x30,0x39]') DISTRIBUTE BY rand();"

#add index on the files
hadoop fs -ls hdfs://nameservice1/user/richdm/hive/DB/terminal.db/exter_terminal_flowcollect_collectday/"day="$day/  | awk '{print $8}' > files0.txt
kinit -k -t /hadoop1/key/richdm.keytab richdm@HADOOP.COM
while read line
    do
        indexpaths=$indexpaths" "$line
   done < ./files0.txt
echo $indexpaths | tr -d '\r'
hadoop jar /hadoop1/cloudera-manager/parcel-repo/GPLEXTRAS-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/hadoop-lzo-0.4.15-cdh5.10.0.jar com.hadoop.compression.lzo.DistributedLzoIndexer $indexpaths