一、课程回顾
1.共同的问题
(1)换行
lines terminated by ‘\n’ -> 让Java将换行符作为两条数据分分隔符
…metastore?\ncreate…
xml的配置文件 不要有中文,不要有空格,不要有tab,不要换行 ****
(2)关于日志
hive的日志需要自己配置一下的
[hadoop@hadoop212 conf]$ vim hive-log4j.properties
2.核心知识点
(1)关于Hive
- Hive是一个Java编写的应用程序,就是没有图形化界面,只提供了一个shell的界面进行操作。
- Hive是将你编写的SQL语句(HQL),转换为MR程序并提交Hadoop集群运行。
- Hive将hdfs上的结构化的数据文件,映射成数据表,然后进行分析。
(2)关于元数据
- 默认的derby元数据,基本上不会有人使用,因为只支持1个连接。
- 远程MySQL,提供元数据存储服务。
二、Hive语句练习
1.emp数据表分析
hive> show databases;
OK
default
Time taken: 4.961 seconds, Fetched: 1 row(s)
hive> show databases;
OK
default
Time taken: 0.023 seconds, Fetched: 1 row(s)
hive> create database empManager;
OK
Time taken: 0.252 seconds
hive> show databases;
OK
default
empmanager
Time taken: 0.021 seconds, Fetched: 2 row(s)
hive> use empmanager;
OK
Time taken: 0.021 seconds
hive> create table emp(
> empno int,
> ename string,
> job string,
> mgr int,
> hiredate string,
> sal double,
> comm double,
> deptno int
> )
> row format delimited fields terminated by '\t';
OK
Time taken: 0.223 seconds
hive> load data local inpath '/opt/datas/emp.txt' into table emp;
Loading data to table empmanager.emp
Table empmanager.emp stats: [numFiles=1, totalSize=656]
OK
Time taken: 0.644 seconds
hive> select * from emp;
OK
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
Time taken: 0.309 seconds, Fetched: 14 row(s)
hive> create table dept(
> deptno int,
> dname string,
> loc string
> )
> row format delimited fields terminated by '\t';
OK
Time taken: 0.088 seconds
hive> load data inpath '/datas/dept.txt' into table dept;
Loading data to table empmanager.dept
Table empmanager.dept stats: [numFiles=1, totalSize=79]
OK
Time taken: 0.379 seconds
hive> select * from dept;
OK
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
Time taken: 0.085 seconds, Fetched: 4 row(s)
从本地读取数据 load data local inpath 复制文件到hdfs上 bin/hdfs dfs -put
从集群上读取数据 load data inpath 移动文件到warehouse中 bin/hdfs dfs -mv
2.创建数据表的方式介绍
- 最普通的方式
hive> create table dept(
> deptno int,
> dname string,
> loc string
> )
> row format delimited fields terminated by '\t';
- 使用as关键字
as关键字将表结果和数据都进行复制
hive> create table emp2 as select * from emp;
Query ID = hadoop_20200725091111_26fe472d-7fb6-4abb-958d-1e857c610b91
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1595637424134_0001, Tracking URL = http://hadoop212:8088/proxy/application_159563742
4134_0001/Kill Command = /opt/modules/hadoop-2.6.0-cdh5.7.6/bin/hadoop job -kill job_1595637424134_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-07-25 09:11:37,321 Stage-1 map = 0%, reduce = 0%
2020-07-25 09:11:48,452 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.08 sec
MapReduce Total cumulative CPU time: 4 seconds 80 msec
Ended Job = job_1595637424134_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop212:8020/user/hive/warehouse/empmanager.db/.hive-staging_hive_2020-07-25_0
9-11-17_291_3630837031185948323-1/-ext-10001Moving data to: hdfs://hadoop212:8020/user/hive/warehouse/empmanager.db/emp2
Table empmanager.emp2 stats: [numFiles=1, numRows=14, totalSize=661, rawDataSize=647]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 4.08 sec HDFS Read: 3916 HDFS Write: 733 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 80 msec
OK
Time taken: 33.653 seconds
hive> select * from emp2;
OK
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
Time taken: 0.293 seconds, Fetched: 14 row(s)
- like关键字
仅复制表结构不复制数据
hive> create table emp3 like emp;
OK
Time taken: 0.102 seconds
hive> desc emp3;
OK
empno int
ename string
job string
mgr int
hiredate string
sal double
comm double
deptno int
Time taken: 0.095 seconds, Fetched: 8 row(s)
- location关键字 来指定数据仓库的目录
[hadoop@hadoop212 hadoop-2.6.0-cdh5.7.6]$ bin/hdfs dfs -mkdir /user/hive/warehouse/dept_location
hive> create table dept_location (
> deptno int,
> dname string,
> loc string
> )
> row format delimited fields terminated by '\t'
> location '/user/hive/warehouse/dept_location';
OK
Time taken: 0.085 seconds
hive> load data local inpath '/opt/datas/dept.txt' into table dept_location;
Loading data to table empmanager.dept_location
Table empmanager.dept_location stats: [numFiles=0, numRows=0, totalSize=0, rawDataSize=0]
OK
Time taken: 0.36 seconds
hive> select * from dept_location;
OK
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
Time taken: 0.077 seconds, Fetched: 4 row(s)
3.使用技巧和配置
(1)本地模式
在学习过程中,不是每次都要提交集群去运行任务的,我们需要使用本地模式
打开自动执行本地模式的设置
hive> set hive.exec.mode.local.auto;
hive.exec.mode.local.auto=false
hive> set hive.exec.mode.local.auto=true;
hive> set hive.exec.mode.local.auto;
hive.exec.mode.local.auto=true
这个设置其实你可以写死在hive-site.xml中,但是不推荐这么做
<property>
<name>hive.exec.mode.local.auto</name>
<value>true</value>
</property>
配置完本地模式后,节约了很多时间:
(2)提示信息的显示
修改hive-site.xml
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
hive (default)> show databases;
OK
database_name
default
empmanager
Time taken: 5.253 seconds, Fetched: 2 row(s)
hive (default)> show databases;
OK
database_name
default
empmanager
Time taken: 0.021 seconds, Fetched: 2 row(s)
hive (default)> select * from wc;
OK
wc.id wc.word
1 hadoop
2 spark
3 mysql
4 hadoop
5 hadoop
6 mysql
Time taken: 0.658 seconds, Fetched: 6 row(s)
三、案例分析
1.python和R语言谁更适合机器学习?
其实这个调查范围的局限性,所以调查结果只能参考,实际意义不大!
1,usca,1,1,1,0,0,0,1,1,1,0,0,1,0,7,Anaconda;scikit-learn;Python;R language;SQL language;Tensorflow;Other Deep Learning Tools
(1)创建数据库、数据表、读取数据
hive (default)> create database db_hive;
OK
Time taken: 0.099 seconds
hive (default)> use db_hive;
OK
Time taken: 0.027 seconds
hive (db_hive)> create table db_hive.tb_lc(
> id string,area string,python string,r string,sql string,rm string,excel string,spark st
ring,ms string,tensorflow string,scikit string,tableau string,knime string,deep string,spHadoop string,ntools int,toolstr string) > row format delimited fields terminated by ','
> lines terminated by '\n';
OK
Time taken: 0.163 seconds
hive (db_hive)> load data local inpath '/opt/datas/sw17.csv' overwrite into table db_hive.tb_lc;
Loading data to table db_hive.tb_lc
Table db_hive.tb_lc stats: [numFiles=1, numRows=0, totalSize=318533, rawDataSize=0]
OK
Time taken: 0.699 seconds
hive (db_hive)> select * from db_hive.tb_lc limit 5;
OK
tb_lc.id tb_lc.area tb_lc.python tb_lc.r tb_lc.sql tb_lc.rm tb_lc.excel t
b_lc.spark tb_lc.ms tb_lc.tensorflow tb_lc.scikit tb_lc.tableau tb_lc.knime tb_lc.deep tb_lc.sphadoop tb_lc.ntools tb_lc.toolstr1 usca 1 1 1 0 0 0 1 1 1 0 0 A
naconda;scikit-learn;Python;R language;SQL language;Tensorflow;Other Deep Learning Tools2 euro 0 0 0 0 0 0 0 0 0 0 0 O
range3 euro 1 0 1 1 0 0 0 0 0 0 0 P
yCharm;RapidMiner;Weka;Java;Python;SQL language;Open Source Hadoop Tools;SQL on Hadoop tools4 asia 1 1 0 1 0 0 0 0 0 0 0 C
4.5/C5.0/See5;PyCharm;RapidMiner;XLMiner;Python;R language5 afme 1 1 0 1 1 1 1 1 1 1 0 2
2 Anaconda;Dataiku;Excel;H2O.ai;IBM SPSS Statistics;Microsoft SQL Server;Orange;RapidMiner;scikit-learn;Tableau;Weka;Python;R language;Spark;Caffe ;Keras;Microsoft CNTK;mxnet;Tensorflow;Theano;Torch;PyTorchTime taken: 0.088 seconds, Fetched: 5 row(s)
hive (db_hive)>
(2)对于大数据spark/hadoop来说,python用的人多还是R用的人多?
同时使用spark/hadoop 和 python的人数:
hive (db_hive)> select * from db_hive.tb_lc limit 5;
OK
tb_lc.id tb_lc.area tb_lc.python tb_lc.r tb_lc.sql tb_lc.rm tb_lc.excel t
b_lc.spark tb_lc.ms tb_lc.tensorflow tb_lc.scikit tb_lc.tableau tb_lc.knime tb_lc.deep tb_lc.sphadoop tb_lc.ntools tb_lc.toolstr1 usca 1 1 1 0 0 0 1 1 1 0 0 A
naconda;scikit-learn;Python;R language;SQL language;Tensorflow;Other Deep Learning Tools2 euro 0 0 0 0 0 0 0 0 0 0 0 O
range3 euro 1 0 1 1 0 0 0 0 0 0 0 P
yCharm;RapidMiner;Weka;Java;Python;SQL language;Open Source Hadoop Tools;SQL on Hadoop tools4 asia 1 1 0 1 0 0 0 0 0 0 0 C
4.5/C5.0/See5;PyCharm;RapidMiner;XLMiner;Python;R language5 afme 1 1 0 1 1 1 1 1 1 1 0 2
2 Anaconda;Dataiku;Excel;H2O.ai;IBM SPSS Statistics;Microsoft SQL Server;Orange;RapidMiner;scikit-learn;Tableau;Weka;Python;R language;Spark;Caffe ;Keras;Microsoft CNTK;mxnet;Tensorflow;Theano;Torch;PyTorchTime taken: 0.088 seconds, Fetched: 5 row(s)
hive (db_hive)> select count(*) cnt_shp from db_hive.tb_lc where sphadoop=1 and python=1;
Query ID = hadoop_20200725142222_fb73d3c3-47eb-4e90-a090-2373e424c815
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1595637424134_0004, Tracking URL = http://hadoop212:8088/proxy/application_159563742
4134_0004/Kill Command = /opt/modules/hadoop-2.6.0-cdh5.7.6/bin/hadoop job -kill job_1595637424134_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-07-25 14:22:35,715 Stage-1 map = 0%, reduce = 0%
2020-07-25 14:22:46,846 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.57 sec
2020-07-25 14:22:54,515 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.48 sec
MapReduce Total cumulative CPU time: 7 seconds 480 msec
Ended Job = job_1595637424134_0004
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.48 sec HDFS Read: 327181 HDFS Write: 4 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 480 msec
OK
cnt_shp
683
Time taken: 29.315 seconds, Fetched: 1 row(s)
同时使用spark/hadoop 和R的人数:
hive (db_hive)> select count(*) cnt_shr from db_hive.tb_lc where sphadoop=1 and r=1;
Automatically selecting local only mode for query
Query ID = hadoop_20200725142525_ed3b1bc1-0f3f-4611-876a-245314812521
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2020-07-25 14:25:12,677 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1060290705_0001
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 645368 HDFS Write: 55479660 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
cnt_shr
606
Time taken: 1.605 seconds, Fetched: 1 row(s)
R在科学研究领域比较多,Python在开发应用领域比较多。
如果我们要将结果保存起来,保存为文件。底层还是mr程序,将输出的结果保存至hdfs上。
查询SQL:
select t1.cnt_shp as shp , t2.cnt_shr as shr from
(select '1' as id,count(*) cnt_shp from db_hive.tb_lc where sphadoop=1 and python=1) as t1
join
(select '1' as id,count(*) cnt_shr from db_hive.tb_lc where sphadoop=1 and r=1) as t2
on t1.id=t2.id ;
实现:
hive (db_hive)> create table if not exists db_hive.tb_result1 as
> select t1.cnt_shp as shp , t2.cnt_shr as shr from
> (select '1' as id,count(*) cnt_shp from db_hive.tb_lc where sphadoop=1 and python=1) as
t1 > join
> (select '1' as id,count(*) cnt_shr from db_hive.tb_lc where sphadoop=1 and r=1) as t2
> on t1.id=t2.id ;
Automatically selecting local only mode for query
Query ID = hadoop_20200725144848_353a820d-8d7a-4d1d-b00d-ee939387913d
Total jobs = 5
Launching Job 1 out of 5
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2020-07-25 14:48:07,178 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1605529441_0007
Launching Job 2 out of 5
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2020-07-25 14:48:08,470 Stage-4 map = 100%, reduce = 100%
Ended Job = job_local710312008_0008
Stage-8 is selected by condition resolver.
Stage-9 is filtered out by condition resolver.
Stage-2 is filtered out by condition resolver.
Execution log at: /tmp/hadoop/hadoop_20200725144848_353a820d-8d7a-4d1d-b00d-ee939387913d.log
2020-07-25 02:48:11 Starting to launch local task to process map join; maximum memory = 477626
3682020-07-25 02:48:12 Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/hadoop/
7a6d78e1-f740-4f09-baa2-ea4b542e1033/hive_2020-07-25_14-48-05_589_5138958313151404537-1/-local-10005/HashTable-Stage-5/MapJoin-mapfile41--.hashtable2020-07-25 02:48:12 Uploaded 1 File to: file:/tmp/hadoop/7a6d78e1-f740-4f09-baa2-ea4b542e1033/hive_
2020-07-25_14-48-05_589_5138958313151404537-1/-local-10005/HashTable-Stage-5/MapJoin-mapfile41--.hashtable (283 bytes)2020-07-25 02:48:12 End of local task; Time Taken: 0.983 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 4 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2020-07-25 14:48:14,681 Stage-5 map = 100%, reduce = 0%
Ended Job = job_local23033021_0009
Moving data to: hdfs://hadoop212:8020/user/hive/warehouse/db_hive.db/tb_result1
Table db_hive.tb_result1 stats: [numFiles=1, numRows=1, totalSize=8, rawDataSize=7]
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 3186210 HDFS Write: 1076 SUCCESS
Stage-Stage-4: HDFS Read: 3823276 HDFS Write: 1308 SUCCESS
Stage-Stage-5: HDFS Read: 1911850 HDFS Write: 794 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
shp shr
Time taken: 9.404 seconds
hive (db_hive)> select * from db_hive.tb_result1;
OK
tb_result1.shp tb_result1.shr
683 606
Time taken: 0.07 seconds, Fetched: 1 row(s)
hdfs上文件中存储的数据:
字段信息存储在MySQL元数据中:
HQL是支持join的,但不推荐这么做。分析的数据来源是存储在hdfs上的文件,那么要让文件中的数据进行表连接查询,需要消耗大量的计算资源和磁盘IO,是不到万不得已不这么做。
自己先使用as关键字 将两张表拼接为一张大表 在对这一张表进行分析。
HQL不支持子查询
2.Hive语句的其他执行方式
2.链家网二手房交易记录分析
数据来源2005年好像。。。
梅园六街坊,2室0厅,47.72,浦东,低区/6层,朝南,500,104777,1992年建
小区名称,户型 ,面积,区域,层数,朝向,总价(万),单价(元/平方),房龄
创建数据库、数据表、加载数据:
hive (default)> show databases;
OK
database_name
db_hive
default
empmanager
Time taken: 4.989 seconds, Fetched: 3 row(s)
hive (default)> set hive.exec.mode.local.auto=true;
hive (default)> create database db_lianjia;
OK
Time taken: 0.223 seconds
hive (default)> use db_lianjia;
OK
Time taken: 0.06 seconds
hive (db_lianjia)> create table tb_lj(
> village_name string,
> house_type string,
> house_area string,
> region string,
> floor_str string,
> direction string,
> total_price string,
> square_price string,
> build_date string
> )
> row format delimited fields terminated by ','
> lines terminated by '\n'
> stored as textfile;
OK
Time taken: 0.259 seconds
hive (db_lianjia)> load data local inpath '/opt/datas/2nd_house_price.csv' overwrite into table tb_lj;
Loading data to table db_lianjia.tb_lj
Table db_lianjia.tb_lj stats: [numFiles=1, numRows=0, totalSize=2225947, rawDataSize=0]
OK
Time taken: 0.646 seconds
hive (db_lianjia)> select * from tb_lj limit 3;
OK
tb_lj.village_name tb_lj.house_type tb_lj.house_area tb_lj.region tb_lj.floor_stt
b_lj.direction tb_lj.total_price tb_lj.square_price tb_lj.build_date梅园六街坊 2室0厅 47.72 浦东 低区/6层 朝南 500 104777 1992年建
碧云新天地(一期) 3室2厅 108.93 浦东 低区/6层 朝南 735 67474 2002年建
博山小区 1室1厅 43.79 浦东 中区/6层 朝南 260 59374 1988年建
Time taken: 0.31 seconds, Fetched: 3 row(s)
hive (db_lianjia)>
(1)哪个面积区域的房子最受欢迎?
面积区域的划分:
50以下 | 50-70 | 70-90 | 90-110 | 110-130 | 130-150 | 150以上
hive (db_lianjia)> select t.area_group,count(*) as total
> from
> (select
> case
> when 0<house_area and house_area<=50 then '50平以下'
> when 50<house_area and house_area<=70 then '50-70平'
> when 70<house_area and house_area<=90 then '70-90平'
> when 90<house_area and house_area<=110 then '90-110平'
> when 110<house_area and house_area<=130 then '110-130平'
> when 130<house_area and house_area<=150 then '130-150平'
> else '150以上'
> end as area_group
> from db_lianjia.tb_lj ) as t
> group by t.area_group order by total desc;
Automatically selecting local only mode for query
Query ID = hadoop_20200725161717_93eaff12-2d84-4c3b-8e8f-bc311e73871b
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2020-07-25 16:18:02,165 Stage-1 map = 100%, reduce = 0%
2020-07-25 16:18:03,184 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local243861579_0001
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2020-07-25 16:18:04,685 Stage-2 map = 100%, reduce = 100%
Ended Job = job_local61328738_0002
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 4468278 HDFS Write: 4452202 SUCCESS
Stage-Stage-2: HDFS Read: 4469086 HDFS Write: 4452647 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
t.area_group total
50-70平 5775
70-90平 5739
90-110平 4505
50平以下 4291
110-130平 2798
150以上 2775
130-150平 2318
Time taken: 5.321 seconds, Fetched: 7 row(s)
结论:面积越小越受欢迎!总价低!
(2)哪个房龄区间的房子最受欢迎
5年以内 | 10年以内 | 15年以内 | 20年以内 | 20年以上
hive (db_lianjia)> select t.year_group,count(*) as total
> from
> (select
> case
> when (2020-substring(build_date,0,4)) between 0 and 5 then '5年以内'
> when (2020-substring(build_date,0,4)) between 6 and 10 then '10年以内'
> when (2020-substring(build_date,0,4)) between 11 and 15 then '15年以内'
> when (2020-substring(build_date,0,4)) between 16 and 20 then '20年以内'
> else '20年以上'
> end as year_group
> from db_lianjia.tb_lj) as t
> group by t.year_group order by total desc;
Automatically selecting local only mode for query
Query ID = hadoop_20200725164444_6a3b905e-baa6-4c2d-b165-b45f73d7b42e
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2020-07-25 16:44:18,655 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local602938329_0003
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2020-07-25 16:44:20,050 Stage-2 map = 100%, reduce = 100%
Ended Job = job_local421398031_0004
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 8929446 HDFS Write: 4453037 SUCCESS
Stage-Stage-2: HDFS Read: 8930144 HDFS Write: 4453418 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
t.year_group total
20年以上 17654
15年以内 4422
20年以内 3125
10年以内 2954
5年以内 46
Time taken: 2.964 seconds, Fetched: 5 row(s)
结论:房龄越老越“香”;
(3)扩展练习,课后完成
- 根据其他的房屋特点指定需求和SQL
- 将结果保存为数据文件
- 上面所有的步骤以SQL文件形式运行,该文件可以反复运行并不会抛出异常!
扩展:mr排序输出
hive> select word,count(word) cnt from wc group by word order by cnt desc;
Query ID = hadoop_20200725103838_c9ea80b4-e34c-42b7-9b5c-da3c9bf24d5e
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1595637424134_0002, Tracking URL = http://hadoop212:8088/proxy/application_159563742
4134_0002/Kill Command = /opt/modules/hadoop-2.6.0-cdh5.7.6/bin/hadoop job -kill job_1595637424134_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-07-25 10:39:03,173 Stage-1 map = 0%, reduce = 0%
2020-07-25 10:39:13,043 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.21 sec
2020-07-25 10:39:20,579 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.39 sec
MapReduce Total cumulative CPU time: 6 seconds 390 msec
Ended Job = job_1595637424134_0002
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1595637424134_0003, Tracking URL = http://hadoop212:8088/proxy/application_159563742
4134_0003/Kill Command = /opt/modules/hadoop-2.6.0-cdh5.7.6/bin/hadoop job -kill job_1595637424134_0003
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2020-07-25 10:39:29,542 Stage-2 map = 0%, reduce = 0%
2020-07-25 10:39:35,943 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.34 sec
2020-07-25 10:39:43,463 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 4.11 sec
MapReduce Total cumulative CPU time: 4 seconds 110 msec
Ended Job = job_1595637424134_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.39 sec HDFS Read: 6289 HDFS Write: 169 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 4.11 sec HDFS Read: 4698 HDFS Write: 25 SUCCESS
Total MapReduce CPU Time Spent: 10 seconds 500 msec
OK
hadoop 3
mysql 2
spark 1
Time taken: 51.236 seconds, Fetched: 3 row(s)
hive>