我们经常利用Hive完成一些数据报表统计的需求,此时我们在特定场景中需要对Hive查询的结果做导出,

这样就延伸出了几种查询结果导出的方法,我们这里做一下总结。


1.利用beeline 执行查询,将输出重定向的方式

2.利用insert overwrite 的方式,将查询结果导出hdfs或者本地


第一步我们看下主体的查询语句:

use default;


SELECT media_info.media_name, media_tmp.* , activity_info.*, camp_media_tmp.*
FROM 
(
	SELECT
        	activity.id,
        	activity.activity_name,
       		advertiser.advertiser_name,
        	agency.agency_name
	FROM promotion_activity_mysql AS activity
	JOIN admin_advertiser_mysql AS advertiser ON activity.advertiser_id = advertiser.id
	JOIN admin_agency_mysql AS agency ON advertiser.agency_id = agency.id
) AS activity_info
JOIN 
(
	SELECT
        	mid.mediaid     ,
       		mid.campaignid  ,
       		SUM(mid.campimp) as camp_imp    ,
        	SUM(mid.campclick) as camp_click        ,
        	SUM(mid.campclickcost) as click_cost
	FROM
        	clickcube_mid AS  mid
	WHERE   mid.day >= '2018-06-01' AND mid.day <= '2018-06-30'
	GROUP BY  mid.mediaid, mid.campaignid
) AS camp_media_tmp ON activity_info.id = camp_media_tmp.campaignid 
JOIN
(
	SELECT
        	'2018-06'       ,
        	mid.mediaid     ,
        	SUM(mid.mediaimp) AS  media_imp_total   ,
        	SUM(mid.mediaclick) AS media_clk_total      ,
        	SUM(mid.mediaclickcost) AS media_cost
	FROM
        	clickcube_mid AS  mid
	WHERE   mid.day >= '2018-06-01' AND mid.day <= '2018-06-30'
	GROUP BY  mid.mediaid
) AS media_tmp ON  camp_media_tmp.mediaid = media_tmp.mediaid
JOIN 
admin_media_mysql AS media_info ON media_info.id = media_tmp.mediaid 
ORDER BY media_tmp.mediaid, camp_media_tmp.campaignid 
;


1.首先我们看下beeline执行查询,文件重定向的方式

缺点:该种将数据导出的方式,其实跟在控制台将查询结果输出并没有什么区别。

输出会包含很多冗余的信息


首先我们需要对beeline 的参数较为了解:

传送门:

Hive _ Hive2 新版连接工具 beeline 详解



这里我们需要注意一个问题,hive查询默认不显示表头的,我们需要打开显示表头的开关

我们将这个操作写到 **.init 文件中

final_to_hdfs_201806.init

SET hive.cli.print.header=true;


这里我们就可以利用文件中重定向了,我们将beeline 执行的过程写到shell 脚本中 .

hive中udtf问题 hive out_Hive


final_to_file.sh

#!/bin/bash


beeline -u jdbc:hive2://10.180.0.26:10000 -n cloudera-scm -f final_to_file_201806_report.hql -i final_to_hdfs_201806.init > final_report_file.txt



看下最终执行的结果

[cloudera-scm@dmp-job002 20180703_media_agency_job]$ head -n 10 final_report_file.txt 
0: jdbc:hive2://10.180.0.26:10000> 
0: jdbc:hive2://10.180.0.26:10000> SET hive.cli.print.header=true;
0: jdbc:hive2://10.180.0.26:10000> 
0: jdbc:hive2://10.180.0.26:10000> 
0: jdbc:hive2://10.180.0.26:10000> 
0: jdbc:hive2://10.180.0.26:10000> use default;
0: jdbc:hive2://10.180.0.26:10000> 
0: jdbc:hive2://10.180.0.26:10000> SELECT media_info.media_name, media_tmp.* , activity_info.*, camp_media_tmp.*
. . . . . . . . . . . . . . . . .> FROM 
. . . . . . . . . . . . . . . . .> (



2.利用insert overwrite 的方式,将查询结果导出hdfs或者本地

缺点: 该种方式并没有数据表头的信息,需要手动增加表头



首先我们看下对Hive 数据导出语句 INSERT OVERWRITE 的介绍:

Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
  [ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
  SELECT ... FROM ...
 
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
 
  
row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)

hive中udtf问题 hive out_hive中udtf问题_02


可以该种方式的支持比较完善

首先导出的数据格式是可选的:

[STORED AS file_format] (Note: Only available starting with Hive 0.11.0)

也就是说导出的数据格式是可选的,到目前为止可选的数据文件类型主要为以下几种文件:

file_format:  
  : SEQUENCEFILE

    含有数据的元信息,可分块,数据按行组织起来

| TEXTFILE  
    TEXTFILE 最原始的文本文件

  | RCFILE     (Note: Only available starting with Hive 0.6.0)

    数据按列组织起来

| ORC        (Note: Only available starting with Hive 0.11.0)  

  | AVRO       (Note: Only available starting with Hive 0.14.0)  

  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname  


  | PARQUET

     数据按行组织起来



这里我们存为textfile

示例:

STORED AS TEXTFILE


其次导出的字段分隔符是可以自己设定的,   

Hive 中字段之间的分隔符默认是 ^A (ascii码是\00001),

不便于修改查看,我们将分隔符改为 , (逗号)


示例:

ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','


最后查询的数据文件的位置是可以指定的,可以存放在执行查询的机器的本地磁盘上,也可以存放在hdfs上,这里我们指定hdfs.

INSERT OVERWRITE [LOCAL] DIRECTORY directory1


示例:

INSERT OVERWRITE directory '/tmp/201806_report'




最终的查询语句

use default;

INSERT OVERWRITE directory '/tmp/201806_report'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
-- STORED AS SEQUENCEFILE
SELECT media_info.media_name, media_tmp.* , activity_info.*, camp_media_tmp.*
FROM 
(
	SELECT
        	activity.id,
        	activity.activity_name,
       		advertiser.advertiser_name,
        	agency.agency_name
	FROM promotion_activity_mysql AS activity
	JOIN admin_advertiser_mysql AS advertiser ON activity.advertiser_id = advertiser.id
	JOIN admin_agency_mysql AS agency ON advertiser.agency_id = agency.id
) AS activity_info
JOIN 
(
	SELECT
        	mid.mediaid     ,
       		mid.campaignid  ,
       		SUM(mid.campimp) as camp_imp    ,
        	SUM(mid.campclick) as camp_click        ,
        	SUM(mid.campclickcost) as click_cost
	FROM
        	clickcube_mid AS  mid
	WHERE   mid.day >= '2018-06-01' AND mid.day <= '2018-06-30'
	GROUP BY  mid.mediaid, mid.campaignid
) AS camp_media_tmp ON activity_info.id = camp_media_tmp.campaignid 
JOIN
(
	SELECT
        	'2018-06'       ,
        	mid.mediaid     ,
        	SUM(mid.mediaimp) AS  media_imp_total   ,
        	SUM(mid.mediaclick) AS media_clk_total      ,
        	SUM(mid.mediaclickcost) AS media_cost
	FROM
        	clickcube_mid AS  mid
	WHERE   mid.day >= '2018-06-01' AND mid.day <= '2018-06-30'
	GROUP BY  mid.mediaid
) AS media_tmp ON  camp_media_tmp.mediaid = media_tmp.mediaid
JOIN 
admin_media_mysql AS media_info ON media_info.id = media_tmp.mediaid 
ORDER BY media_tmp.mediaid, camp_media_tmp.campaignid 
;


我们将执行过程写入脚本中

#!/bin/bash


beeline -u jdbc:hive2://10.180.0.26:10000 -n cloudera-scm -f final_to_hdfs_201806_report.hql -i final_to_hdfs_201806.init



最终的执行结果:

hive中udtf问题 hive out_bc_03

我们查看下执行结果:

hive中udtf问题 hive out_bc_04