我们经常利用Hive完成一些数据报表统计的需求,此时我们在特定场景中需要对Hive查询的结果做导出,
这样就延伸出了几种查询结果导出的方法,我们这里做一下总结。
1.利用beeline 执行查询,将输出重定向的方式
2.利用insert overwrite 的方式,将查询结果导出hdfs或者本地
第一步我们看下主体的查询语句:
use default;
SELECT media_info.media_name, media_tmp.* , activity_info.*, camp_media_tmp.*
FROM
(
SELECT
activity.id,
activity.activity_name,
advertiser.advertiser_name,
agency.agency_name
FROM promotion_activity_mysql AS activity
JOIN admin_advertiser_mysql AS advertiser ON activity.advertiser_id = advertiser.id
JOIN admin_agency_mysql AS agency ON advertiser.agency_id = agency.id
) AS activity_info
JOIN
(
SELECT
mid.mediaid ,
mid.campaignid ,
SUM(mid.campimp) as camp_imp ,
SUM(mid.campclick) as camp_click ,
SUM(mid.campclickcost) as click_cost
FROM
clickcube_mid AS mid
WHERE mid.day >= '2018-06-01' AND mid.day <= '2018-06-30'
GROUP BY mid.mediaid, mid.campaignid
) AS camp_media_tmp ON activity_info.id = camp_media_tmp.campaignid
JOIN
(
SELECT
'2018-06' ,
mid.mediaid ,
SUM(mid.mediaimp) AS media_imp_total ,
SUM(mid.mediaclick) AS media_clk_total ,
SUM(mid.mediaclickcost) AS media_cost
FROM
clickcube_mid AS mid
WHERE mid.day >= '2018-06-01' AND mid.day <= '2018-06-30'
GROUP BY mid.mediaid
) AS media_tmp ON camp_media_tmp.mediaid = media_tmp.mediaid
JOIN
admin_media_mysql AS media_info ON media_info.id = media_tmp.mediaid
ORDER BY media_tmp.mediaid, camp_media_tmp.campaignid
;
1.首先我们看下beeline执行查询,文件重定向的方式
缺点:该种将数据导出的方式,其实跟在控制台将查询结果输出并没有什么区别。
输出会包含很多冗余的信息
首先我们需要对beeline 的参数较为了解:
传送门:
Hive _ Hive2 新版连接工具 beeline 详解
这里我们需要注意一个问题,hive查询默认不显示表头的,我们需要打开显示表头的开关
我们将这个操作写到 **.init 文件中
final_to_hdfs_201806.init
SET hive.cli.print.header=true;
这里我们就可以利用文件中重定向了,我们将beeline 执行的过程写到shell 脚本中 .
final_to_file.sh
#!/bin/bash
beeline -u jdbc:hive2://10.180.0.26:10000 -n cloudera-scm -f final_to_file_201806_report.hql -i final_to_hdfs_201806.init > final_report_file.txt
看下最终执行的结果
[cloudera-scm@dmp-job002 20180703_media_agency_job]$ head -n 10 final_report_file.txt
0: jdbc:hive2://10.180.0.26:10000>
0: jdbc:hive2://10.180.0.26:10000> SET hive.cli.print.header=true;
0: jdbc:hive2://10.180.0.26:10000>
0: jdbc:hive2://10.180.0.26:10000>
0: jdbc:hive2://10.180.0.26:10000>
0: jdbc:hive2://10.180.0.26:10000> use default;
0: jdbc:hive2://10.180.0.26:10000>
0: jdbc:hive2://10.180.0.26:10000> SELECT media_info.media_name, media_tmp.* , activity_info.*, camp_media_tmp.*
. . . . . . . . . . . . . . . . .> FROM
. . . . . . . . . . . . . . . . .> (
2.利用insert overwrite 的方式,将查询结果导出hdfs或者本地
缺点: 该种方式并没有数据表头的信息,需要手动增加表头
首先我们看下对Hive 数据导出语句 INSERT OVERWRITE 的介绍:
Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
SELECT ... FROM ...
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)
可以该种方式的支持比较完善
首先导出的数据格式是可选的:
[STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
也就是说导出的数据格式是可选的,到目前为止可选的数据文件类型主要为以下几种文件:
file_format:
: SEQUENCEFILE
含有数据的元信息,可分块,数据按行组织起来
| TEXTFILE
TEXTFILE 最原始的文本文件
| RCFILE (Note: Only available starting with Hive 0.6.0)
数据按列组织起来
| ORC (Note: Only available starting with Hive 0.11.0)
| AVRO (Note: Only available starting with Hive 0.14.0)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
| PARQUET
数据按行组织起来
这里我们存为textfile
示例:
STORED AS TEXTFILE
其次导出的字段分隔符是可以自己设定的,
Hive 中字段之间的分隔符默认是 ^A (ascii码是\00001),
不便于修改查看,我们将分隔符改为 , (逗号)
示例:
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
最后查询的数据文件的位置是可以指定的,可以存放在执行查询的机器的本地磁盘上,也可以存放在hdfs上,这里我们指定hdfs.
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
示例:
INSERT OVERWRITE directory '/tmp/201806_report'
最终的查询语句
use default;
INSERT OVERWRITE directory '/tmp/201806_report'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
-- STORED AS SEQUENCEFILE
SELECT media_info.media_name, media_tmp.* , activity_info.*, camp_media_tmp.*
FROM
(
SELECT
activity.id,
activity.activity_name,
advertiser.advertiser_name,
agency.agency_name
FROM promotion_activity_mysql AS activity
JOIN admin_advertiser_mysql AS advertiser ON activity.advertiser_id = advertiser.id
JOIN admin_agency_mysql AS agency ON advertiser.agency_id = agency.id
) AS activity_info
JOIN
(
SELECT
mid.mediaid ,
mid.campaignid ,
SUM(mid.campimp) as camp_imp ,
SUM(mid.campclick) as camp_click ,
SUM(mid.campclickcost) as click_cost
FROM
clickcube_mid AS mid
WHERE mid.day >= '2018-06-01' AND mid.day <= '2018-06-30'
GROUP BY mid.mediaid, mid.campaignid
) AS camp_media_tmp ON activity_info.id = camp_media_tmp.campaignid
JOIN
(
SELECT
'2018-06' ,
mid.mediaid ,
SUM(mid.mediaimp) AS media_imp_total ,
SUM(mid.mediaclick) AS media_clk_total ,
SUM(mid.mediaclickcost) AS media_cost
FROM
clickcube_mid AS mid
WHERE mid.day >= '2018-06-01' AND mid.day <= '2018-06-30'
GROUP BY mid.mediaid
) AS media_tmp ON camp_media_tmp.mediaid = media_tmp.mediaid
JOIN
admin_media_mysql AS media_info ON media_info.id = media_tmp.mediaid
ORDER BY media_tmp.mediaid, camp_media_tmp.campaignid
;
我们将执行过程写入脚本中
#!/bin/bash
beeline -u jdbc:hive2://10.180.0.26:10000 -n cloudera-scm -f final_to_hdfs_201806_report.hql -i final_to_hdfs_201806.init
最终的执行结果:
我们查看下执行结果: