hive分组取样 hive 分组汇总

转载

mob64ca13f9a97c 2023-08-18 23:35:52

文章标签 hive分组取样大数据 hive 数据库数据 文章分类 Hive 大数据

文章目录

1.使用方式

第一种：Hive交互shell
第二种：Hive JDBC服务
第三种：Hive命令

2.基本操作

管理数据库与管理表

管理数据库：
管理数据库表：

普通表
外部表
分区表
分桶表
修改表
hive表中加载数据
hive表中的数据导出(查询导出(内外部表均可,可指定导入到本地或HDFS))

Hive的查询语法

a. SELECT
b.常用函数
c.LIMIT语句
d.WHERE语句
e.比较运算符（BETWEEN/IN/ IS NULL）
f.LIKE和RLIKE
g.分组
h.JOIN语句

1.使用方式

第一种：Hive交互shell

配置Hive的环境变量

vi /etc/profile.d/hive.sh
添加如下内容保存并退出：
export HIVE_HOME=/export/servers/hive-1.1.0-cdh5.14.0
export PATH=:$PATH:$HIVE_HOME/bin

使配置文件生效：
source /etc/profile

进入hive：
输入hive回车即可进入hive的Shell窗口
查看所有的数据库
create database myhive;
使用该数据库并创建数据库表

•  use myhive;
 create table test1(id int,name string);

注：以上命令操作完成之后，一定要确认mysql里面出来一个数据库hive
hive数据库的作用：

描述数据库和数据表到hdfs文件系统的映射关系

第二种：Hive JDBC服务

以启动node01的hive为例：

启动hiveserver2服务

前台启动：
hive --service hiveserver2

后台启动：
nohup hive --service hiveserver2  &

使用beeline连接hiveserver2

进入beeline：
beeline
连接node01的mysql数据库：
!connect jdbc:hive2://node01:10000

注：如果使用beeline方式连接hiveserver2，一定要保证hive在mysql当中的元数据库已经创建成功，不然就会拒绝连接

第三种：Hive命令

使用 –e 参数来直接执行hql的语句

hive -e "use myhive;select * from test1;"

使用 –f 参数通过指定文本文件来执行hql的语句

创建并修改一个脚本文件：
vim hive.sh

添加如下内容：
#! /bin/bash
use myhive;select * from test1;

运行脚本：
bash hive.sh

2.基本操作

管理数据库与管理表

管理数据库：

create database if not exists myhive;
使用该数据库：
use  myhive;

注：hive的表存放位置模式是由hive-site.xml当中的一个属性指定的
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>

创建数据库并指定hdfs存储位置：
create database myhive2 location '/myhive2';

修改数据库的创建日期：
alter  database  myhive2  set  dbproperties('createtime'='20880611');
注：可以使用alter  database  命令来修改数据库的一些属性。但是数据库的元数据信息是不可更改的，包括数据库的名称以及数据库所在的位置

查看数据库详细信息：
a：基本信息
desc  database  myhive2;
b：更多详细信息
desc database extended  myhive2;

删除数据库：
a：删除一个空数据库，如果数据库下面有数据表，那么就会报错
drop  database  myhive2;
b：强制删除数据库，包含数据库下面的表一起删除
drop  database  myhive  cascade;

管理数据库表：

普通表

a.创建数据库表：
use myhive;
create table stu(id int,name string);
insert into stu values (1,"zhangsan");
select * from stu;

b.创建数据库表并指定字段直接的分隔符及其表数据存储位置
create  table if not exists stu2(id int ,name string) row format delimited fields terminated by '\t' location '/stu2';

c.根据查询结果创建表
create table stu3 as select * from stu2;

d.根据已经存在的表结构创建表
create table stu4 like stu2;

e.查看表的格式化数据
desc formatted stu2;

外部表

外部表说明：
外部表因为是指定其他的hdfs路径的数据加载到表当中来，所以hive表会认为自己不完全独占这份数据，所以删除hive表的时候，数据仍然存放在hdfs当中，不会删掉

管理表和外部表的使用场景：
每天将收集到的网站日志定期流入HDFS文本文件。在外部表（原始日志表）的基础上做大量的统计分析，用到的中间表、结果表使用内部表存储，数据通过SELECT+INSERT进入内部表。

a.创建外部表：
create external table techer (t_id string,t_name string) row format delimited fields terminated by '\t';

b.从本地文件系统向表中加载数据
load data local inpath '/tmp/techer.csv' into table techer;
注：该操作会上传本地文件到HDFS文件系统上(拷贝)

c.加载本地数据并覆盖已有数据
load data local inpath '/tmp/techer.csv' overwrite into table techer;

d.从hdfs文件系统向表中加载数据
注：该操作会移动文件到指定存储表数据的目录中
上传文件到hdfs：
hdfs dfs -put /tmp/techer.csv /
load data inpath '/techer.csv' into table techer;

补充：如果删掉techer表，hdfs的数据仍然存在，并且重新创建表之后，表中就直接存在数据了,因为我们的techer表使用的是外部表，drop table之后，表当中的数据依然保留在hdfs上面了

分区表

概述：
在大数据中，最常用的一种思想就是分治，我们可以把大的文件切割划分成一个个的小的文件，这样每次操作一个小的文件就会很容易了，同样的道理，在hive当中也是支持这种思想的，就是我们可以把大的数据，按照每天，或者每小时进行切分成一个个的小的文件，这样去操作小的文件就会容易得多了

a.创建分区表(带一个分区参数)

create table score(s_id string,c_id string, s_score int) partitioned by (month string) row format delimited fields terminated by '\t';

b.创建一个表带多个分区参数

create table score2 (s_id string,c_id string, s_score int) partitioned by (year string,month string,day string) row format delimited fields terminated by '\t';

c.加载数据到分区表中

load data local inpath '/export/servers/hivedatas/score.csv' into table score partition (month='201806');

d.加载数据到一个多分区的表中去

load data local inpath '/export/servers/hivedatas/score.csv' into table score2 partition(year='2018',month='06',day='01');

e.多分区联合查询使用union all来实现

select * from score where month = '201806' union all select * from score where month = '201806';

f.查看分区

show  partitions  score;

g.添加一个分区

alter table score add partition(month='201805');

h.同时添加多个分区

alter table score add partition(month='201804') partition(month = '201803');

注意：添加分区之后就可以在hdfs文件系统当中看到表下面多了一个文件夹

i.删除分区

alter table score drop partition(month = '201806');

j.自动修复分区

msck  repair   table  score;

k.手动修复分区(就是添加分区)

alter table score add partition(month='201805');

分桶表

概述：将数据按照指定的字段进行分成多个桶中去，说白了就是将数据按照字段进行划分，可以将数据按照字段划分到多个文件当中去

a.开启hive的桶表功能

set hive.enforce.bucketing=true;

b.设置reduce的个数

set mapreduce.job.reduces=3;

c.创建桶表(桶的数量和reduce的数量相同)

create table course (c_id string,c_name string,t_id string) clustered by(c_id) into 3 buckets row format delimited fields terminated by '\t';

注：桶表的数据加载，只能通过insert overwrite。hdfs dfs -put文件或者通过load data无法加载

e.通过insert overwrite给桶表中加载数据

数据来源：查询普通表
创建普通表：
create table course_common (c_id string,c_name string,t_id string) row format delimited fields terminated by '\t';
普通表中加载数据：
load data local inpath '/export/servers/hivedatas/course.csv' into table course_common;
通过insert  overwrite给桶表中加载数据：
insert overwrite table course select * from course_common cluster by(c_id);

f.查看表结构

desc formatted course;

g.查询分桶表

语法和要求
格式：select * from 分桶表 tablesample(bucket x out of y on 分桶表分桶字段);
tablesample 是关键字
x 和y 都是数字, on后面有分桶字段. 根据id分桶的话,只能写id字段.

假如说总共有4桶
怎么抽： 从第x桶开始抽样，每间隔y桶抽一桶，直到抽满 z/y桶

bucket 1 out of 2 on id：
这就是相当于抽第一桶和第三桶(0号和2号桶)

从第1桶(0号桶)开始抽，抽第x+y*(n-1)，一共抽2桶(这个数字是最大桶除以y算出来的) ： 0号桶,2号桶

bucket 2 out of 1 on id：
这样是不行的,因为x需要大于0并且小于y . 而2大于1了,所以就不行

bucket 1 out of 1 on id
x等于y了 ,相当于 从第1桶(0号桶)开始抽，抽第x+y*(n-1)， 也就是 4桶
也相当于查询全部了

一共抽4桶 ： 0号桶,2号桶,1号桶,3号桶

bucket 2 out of 4 on id

从第2桶(1号桶)开始抽 ， ，抽第x+y*(n-1) , 一共抽1桶 ： 1号桶

bucket 2 out of 8 on id

从第2桶(1号桶)开始抽，一共抽0.5桶 ： 1号桶的一半

修改表

a.表重命名

语法格式：
alter  table  old_table_name  rename  to  new_table_name;

把表score4修改成score5：
alter table score4 rename to score5;

b.增加/修改列信息

查询表结构：
desc score5;

添加列：
alter table score5 add columns (mycol string, mysco string);

查询表结构：
desc score5;

修改列(修改列名和列的数据类型)：
alter table score5 change column mysco mysconew int;

c.删除表

drop table score5;

hive表中加载数据

a.直接向分区表中插入数据

create table score3 like score;
insert into table score3 partition(month ='201807') values ('001','002','100');

b.通过查询插入数据

复制表结构：
create table score4 like score;

插入数据：
insert overwrite table score4 partition(month = '201806') select s_id,c_id,s_score from score;
注：关键字overwrite 必须要有

c.多插入模式
概述：常用于实际生产环境当中，将一张表拆开成两部分或者多部分

给score表加载数据：
load data local inpath '/export/servers/hivedatas/score.csv' overwrite into table score partition(month='201806');

创建第一部分表：
create table score_first( s_id string,c_id  string) partitioned by (month string) row format delimited fields terminated by '\t' ;

创建第二部分表：
create table score_second(c_id string,s_score int) partitioned by (month string) row format delimited fields terminated by '\t';

分别给第一部分与第二部分表加载数据：
from score 
insert overwrite table score_first partition(month='201806') select s_id,c_id 
insert overwrite table score_second partition(month = '201806')  select c_id,s_score;

e.将查询的结果保存到一张表当中去

create table score5 as select * from score;

f.创建表时通过location指定加载数据路径

创建表，并指定在hdfs上的位置：
create external table score6 (s_id string,c_id string,s_score int) row format delimited fields terminated by '\t' location '/myscore6';

上传数据到hdfs上：
hdfs dfs -mkdir -p /myscore6
hdfs dfs -put score.csv /myscore6;

查询数据：
select * from score6;

g. export导出与import 导入 hive表数据（内部表操作,只能导入导出到HDFS文件系统中)

复制techer表结果创建techer2
create table techer2 like techer;

将techer表的数据导出到HDFS文件系统上的/export/techer目录中(就是将装有数据的文件复制到对应目录下)
export table techer to  '/export/techer';

将HDFS文件系统的/export/techer目录中的数据导入到techer2表中
import table techer2 from '/export/techer';

注：export和import属于hive的Shell命令

hive表中的数据导出(查询导出(内外部表均可,可指定导入到本地或HDFS))

概述：将hive表中的数据导出到其他任意目录，例如linux本地磁盘，例如hdfs，例如mysql等等

a. insert导出

将查询的结果导出到本地
insert overwrite local directory '/export/score' select * from score;

将查询的结果格式化导出到本地
insert overwrite local directory '/export/exporthive' row format delimited fields terminated by '\t' collection items terminated by '#' select * from student;
注：collection items terminated by '#'   对集合类型使用#来进行分割

将查询的结果导出到HDFS上(没有local)
insert overwrite directory '/export/exporthive' row format delimited fields terminated by '\t' collection items terminated by '#' select * from score;
注: 对于集合类型我们使用#来进行分割，因为这个表里面没有集合类型，所以加不加这个结果都一样

b. Hadoop命令导出到本地

dfs -get /export/servers/000000_0 /export/servers/exporthive/local.txt;

c. hive shell 命令导出

基本语法：（hive -f/-e 执行语句或者脚本 > file）
hive -e "select * from myhive.score;" > /export/servers/exporthive/score.txt

d. sqoop导出

e. 清空表数据(只能清空内部表)

truncate table score5;

Hive的查询语法

a. SELECT

语法结构：

SELECT [ALL | DISTINCT] select_expr, select_expr, ... 
FROM table_reference
[WHERE where_condition] 
[GROUP BY col_list [HAVING condition]] 
[CLUSTER BY col_list 
  | [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list] 
] 
[LIMIT number]

注：

• order by 会对输入做全局排序，因此只有一个reducer时，会导致当输入规模较大时，需要较长的计算时间。
 ASC（ascend）: 升序（默认）
 DESC（descend）: 降序• sort by不是全局排序是输入做全局排序，其在数据进入reducer前完成排序。因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只保证每个reducer的输出有序，不保证全局有序。
• distribute by(字段)根据指定的字段将数据分发到不同的reducer，且分发算法是hash散列。
• Cluster by(字段) 除了具有Distribute by的功能外，还会对该字段进行排序。

因此，如果分桶和sort字段是同一个时，此时，cluster by = distribute by + sort by
但是排序只能是倒序(降序)排序，不能指定排序规则为ASC或者DESC。
分桶表的作用：最大的作用是用来提高join操作的效率；

全表查询：

select * from score;

选择指定列查询：

select s_id ,c_id from score;

给列起别名：

作用和用法：
1.重命名一个列。
2.便于计算。
3.紧跟列名，也可以在列名和别名之间加入关键字‘AS’
例如：
select s_id as myid ,c_id from score;

b.常用函数

1）求总行数（count）：
select count(*) from score;

2）求分数的最大值（max）：
select max(s_score) from score;

3）求分数的最小值（min）：
select min(s_score) from score;

4）求分数的总和（sum）
select sum(s_score) from score;

5）求分数的平均值（avg）
select avg(s_score) from score;

c.LIMIT语句

典型的查询会返回多行数据。LIMIT子句用于限制返回的行数：

查看前三行数据：
select * from score limit 3;
查看3到5行数据
select * from score limit 3,5;

d.WHERE语句

1）使用WHERE 子句，将不满足条件的行过滤掉。
2）WHERE 子句紧随 FROM 子句。
3）案例实操

查询出分数大于60的数据:
select * from score where s_score > 60 ;

e.比较运算符（BETWEEN/IN/ IS NULL）

BETWEEN：
支持的数据类型：基本数据类型
查询在60-80分的人：
select * from score where s_score between 60 and 80;

IN：
查询分数为60或70或80的人
select * from score where s_score IN (60,70,80);

IS NULL：判断是否为空

f.LIKE和RLIKE

使用LIKE运算选择类似的值
选择条件可以包含字符或数字:
%代表零个或多个字符(任意个字符)。
_ 代表一个字符。
RLIKE子句是Hive中这个功能的一个扩展，其可以通过Java的正则表达式这个更强大的语言来指定匹配条件。

查找以8开头的所有成绩：
select * from score where s_score like '8%';

查找第二个数值为9的所有成绩数据：
select * from score where s_score like '_9%';

查找成绩中含9的所有成绩数据：
select * from score where s_score rlike '[9]';
或
select * from score where s_score like '%9%';

g.分组

GROUP BY：
GROUP BY语句通常会和聚合函数一起使用，按照一个或者多个列队结果进行分组，然后对每个组执行聚合操作。

计算每个学生的平均分数：
select s_id ,avg(s_score) from score group by s_id;

计算每个学生最高成绩：
select s_id ,max(s_score) from score group by s_id;

注：
1.group  by的字段，必须是select后面的字段，select后面的字段不能比group  by的字段多
2.group  by语法中出现在select 后面的字段两个要求：
字段是分组字段
必须使用聚合函数应用

HAVING：
having与where不同点：

where针对表中的列发挥作用，查询数据；having针对查询结果中的列发挥作用，筛选数据。
where后面不能写分组函数，而having后面可以使用分组函数。
having只用于group by分组统计语句。

求每个学生的平均分数：
select s_id ,avg(s_score) from score group by s_id;

求每个学生平均分数大于85的人：
select s_id ,avg(s_score) avgscore from score group by s_id having avgscore > 85;

h.JOIN语句

等值JOIN
Hive支持通常的SQL JOIN语句，但是只支持等值连接，不支持非等值连接。

查询分数对应的姓名：
SELECT s.s_id,s.s_score,stu.s_name,stu.s_birth  FROM score s LEFT JOIN student stu ON s.s_id = stu.s_id;

表的别名

好处：
（1）使用别名可以简化查询
（2）使用表名前缀可以提高执行效率

合并老师与课程表：
select * from techer t join course c on t.t_id = c.t_id;

内连接(INNER JOIN 或JOIN)
内连接：只有进行连接的两个表中都存在与连接条件相匹配的数据才会被保留下来。

select * from techer t inner join course c on t.t_id = c.t_id;

左外连接(LEFT JOIN)
左外连接：JOIN操作符左边表中符合WHERE子句的所有记录将会被返回。
查询老师对应的课程

select * from techer t left join course c on t.t_id = c.t_id;

右外连接(RIGHT JOIN)
右外连接：JOIN操作符右边表中符合WHERE子句的所有记录将会被返回

select * from techer t right join course c on t.t_id = c.t_id;

满外连接(FULL JOIN)
满外连接：将会返回所有表中符合WHERE语句条件的所有记录。如果任一表的指定字段没有符合条件的值的话，那么就使用NULL值替代。

含义：就是左连接和右连接的结果之和——注意：没有重复的内连接的那些行。

SELECT * FROM techer t FULL JOIN course c ON t.t_id = c.t_id ;

多表连接

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：hive具备的能力 hive功能介绍

下一篇：html5 上传语音 h5发送语音消息

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯