1. ORDER BY 和SORT BY 语句
order by 会对输入做全局排序,因此只有一个reducer(多个reducer无法保证全局有序)
只有一个reducer,会导致当输入规模较大时,需要较长的计算时间。
sort by不是全局排序,其在数据进入reducer前完成排序.因此,如果用sort by进行排序,并且设置mapred.reduce.tasks>1, 则sort by只保证每个reducer的输出有序,不保证全局有序。
l order by : 全局排序,启动一个reducer进行处理。
l sort by :局部排序,启动多个reducer进行处理,一般业务都是获取topN的排名,一般可以使用sort by进行局部排序,在通过limit n对局部结果进行汇总
Hive提供了关键字limit,在可以获取topN,获取网站页面访问的top10的实现方案:
通过一下方案,我们可以发现order by 启动两个job,sort by 启动3个job。sort by可以启动多个reduce,每个reduce做局部排序,但是这对于sort by limit N已经够用了
方案1: 使用order by 进行排序
select * from ( select referrerpage ,count(1) refer_count from data_collect group by referrerpage ) t order by refer_count desc limit 10;
结果: t.referrerpage t.refer_count http%3A//jf.10086.cn/ 4714 http%3A//jf.10086.cn/portal/ware/web/SearchWareAction%3Faction%3DsearchMyExchangeWareInfo 1684 http%3A//jf.10086.cn/index.html 1171 http%3A//jf.10086.cn/portal/ware/web/SearchWareAction%3Faction%3DsearchWareInfo%26pager.offset%3D12 923 http%3A//jf.10086.cn/portal/ware/web/SearchWareAction%3Faction%3DsearchMyExchangeWareInfo%26pager.offset%3D12 794 http%3A//jf.10086.cn/ware/allClass.jsp 699 http%3A//jf.10086.cn/portal/ware/web/SearchWareAction%3Faction%3DsearchWareInfo%26pager.offset%3D24 685 http%3A//jf.10086.cn/portal/order/web/UserOrderAction%3Faction%3DdirectExchangeWare 680 http%3A//jf.10086.cn/rank/0_0_0_0.html 673
日志分析: hive (jfyun)> select * from ( > select referrerpage ,count(1) refer_count from data_collect group by referrerpage > ) t > order by refer_count desc limit 10; Automatically selecting local only mode for query Total jobs = 2
Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) Hadoop job information for null: number of mappers: 0; number of reducers: 0 2016-01-06 21:08:49,862 null map = 0%, reduce = 0% 2016-01-06 21:08:53,345 null map = 100%, reduce = 0% 2016-01-06 21:08:54,462 null map = 100%, reduce = 100% Ended Job = job_local785531207_0001 Execution completed successfully MapredLocal task succeeded
Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop) Hadoop job information for null: number of mappers: 0; number of reducers: 0 2016-01-06 21:09:03,129 null map = 0%, reduce = 0% 2016-01-06 21:09:04,385 null map = 100%, reduce = 0% 2016-01-06 21:09:05,459 null map = 100%, reduce = 100% Ended Job = job_local127113388_0001 Execution completed successfully MapredLocal task succeeded OK |
方案2: 使用sort by 进行排序
结果: t.referrerpage t.refer_count http%3A//jf.10086.cn/ 4714 http%3A//jf.10086.cn/portal/ware/web/SearchWareAction%3Faction%3DsearchMyExchangeWareInfo 1684 http%3A//jf.10086.cn/index.html 1171 http%3A//jf.10086.cn/portal/ware/web/SearchWareAction%3Faction%3DsearchWareInfo%26pager.offset%3D12 923 http%3A//jf.10086.cn/portal/ware/web/SearchWareAction%3Faction%3DsearchMyExchangeWareInfo%26pager.offset%3D12 794 http%3A//jf.10086.cn/ware/allClass.jsp 699 http%3A//jf.10086.cn/portal/ware/web/SearchWareAction%3Faction%3DsearchWareInfo%26pager.offset%3D24 685 http%3A//jf.10086.cn/portal/order/web/UserOrderAction%3Faction%3DdirectExchangeWare 680 http%3A//jf.10086.cn/rank/0_0_0_0.html 673
日志分析:
|
2. 包含SORT BY 的DISTRIBUTE BY
distribute by按照指定的字段对数据进行划分到不同的输出reduce / 文件中,根据map函数输出的key通过hash函数计算哈希数值,然后得到的哈希数值将键-值对均匀分发到多个reduer中去。
需求:获取每个商品类别和国家下的商品个数,并按照商品类别和国家下的商品销售取前三名。 数据: 类别 国家 商品名称 销售数量 movies us movies_us_1 100 movies us movies_us_2 150 movies us movies_us_3 200 movies us movies_us_4 300 movies gb movies_gb_1 100 movies gb movies_gb_2 150 movies gb movies_gb_3 200 movies gb movies_gb_4 300 office gb office_gb_1 30 office gb office_gb_2 40 office gb office_gb_3 50 office gb office_gb_4 60 office us office_us_1 30 office us office_us_2 50 office us office_us_3 60 office us office_us_4 70 数据表 create external table p_rank_demo(category string,country string,product string,sales int) row format delimited fields terminated by '\t';
加载数据 load data local inpath '/home/hadoop/p_rank_demo' overwrite into table p_rank_demo;
按照商品类别和国家下的商品销售排名(按照category,country分组,并通过sort by每个分组的结果排序) select category,country,product, sales,rank() over(distribute by category,country sort by category,country, sales desc) as rk from p_rank_demo;
或者通过 select category,country,product, sales,rank() over(partition by category,country order by sales desc) as rk from p_rank_demo;
输出通过的结果(两者的区别: distribute by 具有相同的类别和国家的记录都发送到同一个reducer上,这样就可以统计 出每个类别和国家的销售商品的排名了。若业务是全局排序,可以通过partition by order by 获取排名。) category country product sales rk movies gb movies_gb_4 300 1 movies gb movies_gb_3 200 2 movies gb movies_gb_2 150 3 movies gb movies_gb_1 100 4 movies us movies_us_4 300 1 movies us movies_us_3 200 2 movies us movies_us_1 100 3 movies us movies_us_2 100 3 office gb office_gb_4 60 1 office gb office_gb_3 50 2 office gb office_gb_2 40 3 office gb office_gb_1 30 4 office us office_us_4 70 1 office us office_us_3 60 2 office us office_us_2 50 3 office us office_us_1 30 4
按照商品类别和国家下的商品销售排名前三输出 select * from (select category,country,product, sales,rank() over(distribute by category,country sort by category,country, sales desc) as rk from p_rank_demo) t where t.rk <= 3;
输出结果: t.category t.country t.product t.sales t.rk movies gb movies_gb_4 300 1 movies gb movies_gb_3 200 2 movies gb movies_gb_2 150 3 movies us movies_us_4 300 1 movies us movies_us_3 200 2 movies us movies_us_1 100 3 movies us movies_us_2 100 3 office gb office_gb_4 60 1 office gb office_gb_3 50 2 office gb office_gb_2 40 3 office us office_us_4 70 1 office us office_us_3 60 2 office us office_us_2 50 3 |
3. CLUSTER BY
使用distribute by A sort by A 语句同cluster by A语句相同,即:分区和排序属于一个字段。但是cluster by 排序只能是倒序排序,不能指定排序规则为asc 或者desc。