1. 常用缩格式
压缩比(Compression Ratio): Snappy < LZ4 < LZO < GZIP < BZIP2
解压速度(Compression Speed): Snappy > LZ4 > LZO > GZIP > BZIP2
1.1 lzo压缩
优点:压缩/解压速度比较快,合理的压缩率(< 50%);支持split,是hadoop中最流行的压缩格式;支持hadoop native库;可以在linux系统下安装lzop命令,使用方便。
1.2 gzip压缩
优点:压缩率比较高(大于30%),而且压缩/解压缩速度也比较快;hadoop本身支持,在应用中处理gzip格式的文件就和直接处理文本一样;有hadoop native库;大部分linux系统都自带gzip命令,使用方便。
1.3 bzip2压缩
1.4 snappy压缩
优点:高速压缩速度和合理的压缩率(约50%);支持hadoop native库。
压缩格式 | split | native | 压缩率 | 速度 | 是否hadoop自带 | linux命令 | 换成压缩格式后原来应用程序是否需要修改 |
gzip | 否 | 是 | 很高 | 比较快 | 是,直接使用 | 有 | 和文本处理一样,不需要修改 |
lzo | 是 | 是 | 比较高 | 很快 | 否,需要安装 | 有 | 需要建索引,还需要指定输入格式 |
snappy | 否 | 是 | 比较高 | 很快 | 否,需要安装 | 没有 | 和文本处理异议,不需要修改 |
bzip2 | 是 | 否 | 最高 | 慢 | 是,直接使用 | 有 | 和文本处理异议,不需要修改 |
2. 压缩在Hadoop中的应用
- Hadoop jobs are usually IO bound, compressing data can speed up the IO operations;
- Compression reduces the size of data transferred across network;
- Overall job performance may be increased by simple enabing;
- Splittability must be taken into account;
2.1 支持分割(分片)的压缩格式
Compression format | Tool | Algorithm | File extention | Splittable |
gzip | gzip | DEFLATE | .gz | No |
bzip2 | bzip2 | bzip2 | .bz2 | Yes |
LZO | lzop | LZO | .lzo | Yes if indexed |
Snappy | N/A | Snappy | .snappy | No |
- 1G的没压缩的数据Map Task数量:1024M/128M个
- 1G的gzip压缩的数据Map Task的数量:1个(gzip不能分片,所有的数据只能用一个task来处理)
[hadoop@hadoop01 ~]$ hadoop checknative
18/08/13 18:43:08 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
18/08/13 18:43:08 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/lib/native/libhadoop.so.1.0.0
zlib: true /lib64/libz.so.1
snappy: true /lib64/libsnappy.so.1
lz4: true revision:99
bzip2: true /lib64/libbz2.so.1
openssl: true /lib64/libcrypto.so
[hadoop@hadoop01 ~]$
2.2 常用的Codec
- Zlib:org.apache.hadoop.io.compress.DefaultCodec
- Gzip:org.apache.hadoop.io.compress.GzioCodec
- Bzip2:org.apache.hadoop.io.compress.Bzip2Codec
- Lzo:com.apache.compression.lzo.LzoCodec
- Lz4:org.apache.hadoop.io.compress.Lz4Codec
- Snappy:org.apache.hadoop.io.compress.SnappyCodec
2.3 压缩在MapReduce中的应用
- Map:使用可以分割的压缩方式;(bzip2/lzo)
- Shuffle & Sort:中间的过程采用速度快的压缩方式;
- Reduce:采用压缩比最大的压缩方式以节省磁盘(如果这个作业的输出是作为下一个作业的输入,要选择可分割的压缩方式)。
2.4 压缩实战
2.4.1 hadoop的压缩 修改hadoop配置文件
修改完配置文件之后要重启hdfs和yarn服务。 跑一个MapReduce作业测试
[hadoop@hadoop01 ~]$ cd app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/
[hadoop@hadoop01 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /tmp/input.txt /tmp/compression-out/
[hadoop@hadoop01 mapreduce]$
[hadoop@hadoop01 mapreduce]$ hdfs dfs -ls /tmp/compression-out/
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2018-08-13 20:01 /tmp/compression-out/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 65 2018-08-13 20:01 /tmp/compression-out/part-r-00000.bz2
[hadoop@hadoop01 mapreduce]$ hdfs dfs -text /tmp/compression-out/part-r-00000.bz2
18/08/13 20:02:53 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
18/08/13 20:02:53 INFO compress.CodecPool: Got brand-new decompressor [.bz2]
data 1
is 2
sample 1
test 2
this 2
[hadoop@hadoop01 mapreduce]$
2.4.1 hive的压缩
Hive的建表语句里面有一个STORED AS file_format结合使用的方法,指定hive的存储格式。不仅能节省hive的存储空间,还可以提高执行效率。
file_format参考: 不压缩
hive> create table test1(
> c1 string,
> c2 string,
> c3 string,
> c4 string,
> c5 string,
> c6 string,
> c7 string,
> c8 string,
> c9 string,
> c10 string,
> c11 string,
> c12 string,
> c13 string,
> c14 string,
> c15 string,
> c16 string,
> c17 string,
> c18 string,
> c19 string,
> c20 string,
> c21 string,
> c22 string,
> c23 string,
> c24 string,
> c25 string,
> c26 string,
> c27 string,
> c28 string,
> c29 string,
> c30 string,
> c31 string,
> c32 string,
> c33 string
> )
> row format delimited fields terminated by '||';
Time taken: 0.716 seconds
hive> load data local inpath '/home/hadoop/data/20180813000203.txt' overwrite into table test1;
hive> select count(1) from test1;
Time taken: 20.67 seconds, Fetched: 1 row(s)
[hadoop@hadoop01 data]$ hdfs dfs -du -s -h /user/hive/warehouse/test1
37.4 M 37.4 M /user/hive/warehouse/test1
[hadoop@hadoop01 data]$ bzip2压缩
hive> SET hive.exec.compress.output;
hive> SET mapreduce.output.fileoutputformat.compress.codec;
hive> SET hive.exec.compress.output=true;
hive> SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec;
hive> create table test1_bzip2
> row format delimited fields terminated by '||'
> as select * from test1;
[hadoop@hadoop01 data]$ hdfs dfs -du -s -h /user/hive/warehouse/test1_bzip2
450.0 K 450.0 K /user/hive/warehouse/test1_bzip2
[hadoop@hadoop01 data]$ hdfs dfs -ls /user/hive/warehouse/test1_bzip2
Found 1 items
-rwxr-xr-x 1 hadoop supergroup 460749 2018-08-13 20:32 /user/hive/warehouse/test1_bzip2/000000_0.bz2
hive> SET hive.exec.compress.output=false; SEQUENCEFILE存储
hive> create table test1(
> c1 string,
> c2 string,
> c3 string,
> c4 string,
> c5 string,
> c6 string,
> c7 string,
> c8 string,
> c9 string,
> c10 string,
> c11 string,
> c12 string,
> c13 string,
> c14 string,
> c15 string,
> c16 string,
> c17 string,
> c18 string,
> c19 string,
> c20 string,
> c21 string,
> c22 string,
> c23 string,
> c24 string,
> c25 string,
> c26 string,
> c27 string,
> c28 string,
> c29 string,
> c30 string,
> c31 string,
> c32 string,
> c33 string
> )
> row format delimited fields terminated by '||';
Time taken: 0.145 seconds
hive> load data local inpath '/home/hadoop/data/20180813000203.txt' overwrite into table test1;
Loading data to table default.test1
Table default.test1 stats: [numFiles=1, numRows=0, totalSize=39187874, rawDataSize=0]
Time taken: 1.316 seconds
hive> create table test1_seq(
> c1 string,
> c2 string,
> c3 string,
> c4 string,
> c5 string,
> c6 string,
> c7 string,
> c8 string,
> c9 string,
> c10 string,
> c11 string,
> c12 string,
> c13 string,
> c14 string,
> c15 string,
> c16 string,
> c17 string,
> c18 string,
> c19 string,
> c20 string,
> c21 string,
> c22 string,
> c23 string,
> c24 string,
> c25 string,
> c26 string,
> c27 string,
> c28 string,
> c29 string,
> c30 string,
> c31 string,
> c32 string,
> c33 string
> )
> row format delimited fields terminated by '||'
> stored as SEQUENCEFILE;
Time taken: 0.142 seconds
hive> insert into table test1_seq select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0002, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 03:31:06,612 Stage-1 map = 0%, reduce = 0%
2018-08-14 03:31:14,123 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.86 sec
MapReduce Total cumulative CPU time: 1 seconds 860 msec
Ended Job = job_1534181281204_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://
Loading data to table default.test1_seq
Table default.test1_seq stats: [numFiles=1, numRows=76241, totalSize=12155988, rawDataSize=10972575]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.86 sec HDFS Read: 39194480 HDFS Write: 12156070 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 860 msec
Time taken: 17.464 seconds
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M 37.4 M /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_seq;
11.6 M 11.6 M /user/hive/warehouse/test1_seq
hive> RCFILE存储
它是一个混合的行列编成的,它保证所有行的一个列都在一个节点(block)之上,缺点是row group太小了。所以现在基本也不使用了。
hive> create table test1_rcfile(
> c1 string,
> c2 string,
> c3 string,
> c4 string,
> c5 string,
> c6 string,
> c7 string,
> c8 string,
> c9 string,
> c10 string,
> c11 string,
> c12 string,
> c13 string,
> c14 string,
> c15 string,
> c16 string,
> c17 string,
> c18 string,
> c19 string,
> c20 string,
> c21 string,
> c22 string,
> c23 string,
> c24 string,
> c25 string,
> c26 string,
> c27 string,
> c28 string,
> c29 string,
> c30 string,
> c31 string,
> c32 string,
> c33 string
> )
> row format delimited fields terminated by '||'
> stored as rcfile;
Time taken: 0.146 seconds
hive> insert into test1_rcfile select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0003, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0003/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 03:46:15,737 Stage-1 map = 0%, reduce = 0%
2018-08-14 03:46:23,228 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.87 sec
MapReduce Total cumulative CPU time: 1 seconds 870 msec
Ended Job = job_1534181281204_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://
Loading data to table default.test1_rcfile
Table default.test1_rcfile stats: [numFiles=1, numRows=76241, totalSize=8604253, rawDataSize=8532863]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.87 sec HDFS Read: 39194682 HDFS Write: 8604337 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 870 msec
Time taken: 17.323 seconds
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M 37.4 M /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_rcfile;
8.2 M 8.2 M /user/hive/warehouse/test1_rcfile
hive> ORC存储
hive> create table test1_orc(
> c1 string,
> c2 string,
> c3 string,
> c4 string,
> c5 string,
> c6 string,
> c7 string,
> c8 string,
> c9 string,
> c10 string,
> c11 string,
> c12 string,
> c13 string,
> c14 string,
> c15 string,
> c16 string,
> c17 string,
> c18 string,
> c19 string,
> c20 string,
> c21 string,
> c22 string,
> c23 string,
> c24 string,
> c25 string,
> c26 string,
> c27 string,
> c28 string,
> c29 string,
> c30 string,
> c31 string,
> c32 string,
> c33 string
> )
> row format delimited fields terminated by '||'
> stored as orc;
Time taken: 0.116 seconds
hive> insert into test1_orc select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0004, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0004/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 03:53:54,382 Stage-1 map = 0%, reduce = 0%
2018-08-14 03:54:04,091 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.38 sec
MapReduce Total cumulative CPU time: 4 seconds 380 msec
Ended Job = job_1534181281204_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://
Loading data to table default.test1_orc
Table default.test1_orc stats: [numFiles=1, numRows=76241, totalSize=678984, rawDataSize=219269116]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 4.38 sec HDFS Read: 39194660 HDFS Write: 679067 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 380 msec
Time taken: 20.572 seconds
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M 37.4 M /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_orc;
663.1 K 663.1 K /user/hive/warehouse/test1_orc
hive> create table test1_orc_null(
> c1 string,
> c2 string,
> c3 string,
> c4 string,
> c5 string,
> c6 string,
> c7 string,
> c8 string,
> c9 string,
> c10 string,
> c11 string,
> c12 string,
> c13 string,
> c14 string,
> c15 string,
> c16 string,
> c17 string,
> c18 string,
> c19 string,
> c20 string,
> c21 string,
> c22 string,
> c23 string,
> c24 string,
> c25 string,
> c26 string,
> c27 string,
> c28 string,
> c29 string,
> c30 string,
> c31 string,
> c32 string,
> c33 string
> )
> row format delimited fields terminated by '||'
> stored as orc tblproperties ("orc.compress"="NONE");
Time taken: 0.156 seconds
hive> insert into test1_orc_null select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0005, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0005/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 04:02:40,232 Stage-1 map = 0%, reduce = 0%
2018-08-14 04:02:49,821 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.48 sec
MapReduce Total cumulative CPU time: 3 seconds 480 msec
Ended Job = job_1534181281204_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://
Loading data to table default.test1_orc_null
Table default.test1_orc_null stats: [numFiles=1, numRows=76241, totalSize=2087911, rawDataSize=219269116]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.48 sec HDFS Read: 39194703 HDFS Write: 2087999 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 480 msec
Time taken: 19.238 seconds
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M 37.4 M /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_orc;
663.1 K 663.1 K /user/hive/warehouse/test1_orc
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_orc_null;
2.0 M 2.0 M /user/hive/warehouse/test1_orc_null
hive> PARQUET存储
hive> create table test1_parquet(
> c1 string,
> c2 string,
> c3 string,
> c4 string,
> c5 string,
> c6 string,
> c7 string,
> c8 string,
> c9 string,
> c10 string,
> c11 string,
> c12 string,
> c13 string,
> c14 string,
> c15 string,
> c16 string,
> c17 string,
> c18 string,
> c19 string,
> c20 string,
> c21 string,
> c22 string,
> c23 string,
> c24 string,
> c25 string,
> c26 string,
> c27 string,
> c28 string,
> c29 string,
> c30 string,
> c31 string,
> c32 string,
> c33 string
> )
> row format delimited fields terminated by '||'
> stored as parquet;
Time taken: 0.137 seconds
hive> insert into test1_parquet select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0006, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0006/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 04:09:14,081 Stage-1 map = 0%, reduce = 0%
2018-08-14 04:09:23,547 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.47 sec
MapReduce Total cumulative CPU time: 3 seconds 470 msec
Ended Job = job_1534181281204_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://
Loading data to table default.test1_parquet
Table default.test1_parquet stats: [numFiles=1, numRows=76241, totalSize=2239503, rawDataSize=2515953]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.47 sec HDFS Read: 39194597 HDFS Write: 2239588 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 470 msec
Time taken: 19.319 seconds
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M 37.4 M /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_parquet;
2.1 M 2.1 M /user/hive/warehouse/test1_parquet
hive> set parquet.compression=GZIP;
hive> create table test1_parquet_gzip
> row format delimited fields terminated by '||'
> stored as parquet
> as select * from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1534181281204_0007, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0007/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-08-14 04:13:34,200 Stage-1 map = 0%, reduce = 0%
2018-08-14 04:13:43,790 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.85 sec
MapReduce Total cumulative CPU time: 3 seconds 850 msec
Ended Job = job_1534181281204_0007
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://
Moving data to: hdfs://
Table default.test1_parquet_gzip stats: [numFiles=1, numRows=76241, totalSize=605285, rawDataSize=2515953]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.85 sec HDFS Read: 39193754 HDFS Write: 605375 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 850 msec
Time taken: 20.185 seconds
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1;
37.4 M 37.4 M /user/hive/warehouse/test1
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_parquet;
2.1 M 2.1 M /user/hive/warehouse/test1_parquet
hive> !hdfs dfs -du -s -h /user/hive/warehouse/test1_parquet_gzip;
591.1 K 591.1 K /user/hive/warehouse/test1_parquet_gzip
对比了以上所有的存储格式和压缩方式之后,在计算方面也是有区别的(看HDFS Read)。
- 原始表全表数据量是76241行,查询的数据量是39196646
- test1表是行式存储,查询的数据量是39197330
- test1_orc是列式存储,查询的数据量是296842(它只会去读c1列的数据)
- test1_parquet也是列式存储,查询的数据量是1249432(它只会去读c1列的数据,性能会比原始表要好很多,但是没有orc好)
hive> select count(1) from test1;
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1534181281204_0011, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0011/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0011
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-08-14 04:24:12,595 Stage-1 map = 0%, reduce = 0%
2018-08-14 04:24:18,976 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.3 sec
2018-08-14 04:24:27,315 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.68 sec
MapReduce Total cumulative CPU time: 2 seconds 680 msec
Ended Job = job_1534181281204_0011
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.68 sec HDFS Read: 39196646 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 680 msec
Time taken: 23.739 seconds, Fetched: 1 row(s)
hive> select count(1) from test1 where c1 = 'WS-C4506-E';
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1534181281204_0008, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0008/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-08-14 04:19:36,821 Stage-1 map = 0%, reduce = 0%
2018-08-14 04:19:44,152 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.66 sec
2018-08-14 04:19:51,497 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.94 sec
MapReduce Total cumulative CPU time: 2 seconds 940 msec
Ended Job = job_1534181281204_0008
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.94 sec HDFS Read: 39197330 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 940 msec
Time taken: 23.09 seconds, Fetched: 1 row(s)
hive> select count(1) from test1_rc where c1 = 'WS-C4506-E';
FAILED: SemanticException [Error 10001]: Line 1:21 Table not found 'test1_rc'
hive> select count(1) from test1_orc where c1 = 'WS-C4506-E';
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1534181281204_0009, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0009/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-08-14 04:20:22,163 Stage-1 map = 0%, reduce = 0%
2018-08-14 04:20:29,589 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.95 sec
2018-08-14 04:20:36,951 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.29 sec
MapReduce Total cumulative CPU time: 3 seconds 290 msec
Ended Job = job_1534181281204_0009
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.29 sec HDFS Read: 296842 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 290 msec
Time taken: 22.92 seconds, Fetched: 1 row(s)
hive> select count(1) from test1_parquet where c1 = 'WS-C4506-E';
Query ID = hadoop_20180814012828_30c31c75-9676-42cb-8198-19f4c9c364d6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1534181281204_0010, Tracking URL = http://hadoop000:8088/proxy/application_1534181281204_0010/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1534181281204_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-08-14 04:20:53,130 Stage-1 map = 0%, reduce = 0%
2018-08-14 04:21:00,451 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.11 sec
2018-08-14 04:21:07,737 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.53 sec
MapReduce Total cumulative CPU time: 3 seconds 530 msec
Ended Job = job_1534181281204_0010
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.53 sec HDFS Read: 1249432 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 530 msec
Time taken: 22.837 seconds, Fetched: 1 row(s)