在hive中创建表之后需要将数据加载或者导入到表中,然后在hive中才能够用查询语句进行分析,本文就来整理一下hive加载数据的方法。

加载数据到表中的语法参考Loading files into tables,我们摘抄如下:

LOAD DATA [LOCAL] INPATH ``'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] 要加载的原数据有可能存放在本地目录,也有可能存放在hdfs的目录上,如果是存在在本地的话在加载数据的时候需要增加LOCAL关键字;如果是存在hdfs上则不要加该关键字。

OVERWRITE关键字指明对表中的原始数据是否覆盖原来的数据,对于有些应用场景,在加载数据的时候需要覆盖原来的数据则需要增加关键字OVERWRITE,而有些场景是要在表中追加新的数据,则不需要该关键字。

分区表在加载的时候需要增加关键字PARTITION ,同时需要指定分区的名称,否则的话将会无法查询到该分区对应的数据。

1.加载本地文件到hive表

load data local inpath ‘/opt/datas/emp.txt’ into table default.emp ;

hive (default)> truncate table emp ;
OK
Time taken: 1.228 seconds
hive (default)> select * from emp;
OK
empno   ename   job     mgr     hiredate        sal     comm    deptno
Time taken: 0.198 seconds
hive (default)> load data local inpath '/opt/datas/emp.txt' into table default.emp ;
Loading data to table default.emp
Table default.emp stats: [numFiles=1, numRows=0, totalSize=659, rawDataSize=0]
OK
Time taken: 1.234 seconds
hive (default)> select * from emp;
OK
empno   ename   job     mgr     hiredate        sal     comm    deptno
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
Time taken: 0.076 seconds, Fetched: 14 row(s)

2.加载hdfs文件到hive表中

load data inpath ‘/user/hive/user1/datas/emp.txt’ into table default.emp ;

hive (default)> dfs -mkdir -p /user/hive/user1/datas/ ;
hive (default)> dfs -put /opt/datas/emp.txt /user/hive/user1/datas/;
hive (default)> select * from default.emp;
OK
empno   ename   job     mgr     hiredate        sal     comm    deptno
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
Time taken: 0.136 seconds, Fetched: 14 row(s)
hive (default)> load data inpath '/user/hive/user1/datas/emp.txt' into table default.emp ;
Loading data to table default.emp
Table default.emp stats: [numFiles=2, numRows=0, totalSize=1318, rawDataSize=0]
OK
Time taken: 0.622 seconds
hive (default)> select * from default.emp;
OK
empno   ename   job     mgr     hiredate        sal     comm    deptno
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
Time taken: 0.111 seconds, Fetched: 28 row(s)

可以看到在加载后数据由14条变为28条,说明我们后面加载的数据追加到表emp中了。

3.加载数据覆盖表中已有的数据

load data inpath ‘/user/hive/user1/datas/emp.txt’ overwrite into table default.emp ;

hive (default)> dfs -put /opt/datas/emp.txt /user/hive/user1/datas/;
hive (default)> load data inpath '/user/hive/user1/datas/emp.txt' overwrite into table default.emp  ;
Loading data to table default.emp
Table default.emp stats: [numFiles=1, numRows=0, totalSize=659, rawDataSize=0]
OK
Time taken: 0.544 seconds
hive (default)> select * from emp;
OK
empno   ename   job     mgr     hiredate        sal     comm    deptno
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
Time taken: 0.12 seconds, Fetched: 14 row(s)

注意,从hdfs加载数据到hive表中的时候会删除原来hdfs路径下的数据文件,所以我们还需要重新导入数据到hdfs路径下。可以看到使用overwrite关键字后数据又变为14条了,新加载的数据覆盖了原来表中的数据。

4.创建表是通过insert加载

通过insert加载数据的示例如下:

create table default.emp_in like emp ;
insert into table default.emp_in select * from default.emp ;

执行结果如下:

hive (default)> create table default.emp_in like emp ;
OK
Time taken: 0.375 seconds
hive (default)> show tables;
OK
tab_name
dept
dept_cats
dept_part
emp
emp_ext
emp_ext2
emp_in
emp_part
Time taken: 0.07 seconds, Fetched: 8 row(s)
hive (default)> select * from emp_in;
OK
empno   ename   job     mgr     hiredate        sal     comm    deptno
Time taken: 0.254 seconds
hive (default)> insert into table default.emp_in select * from default.emp ;
Query ID = hive_20190214232121_9fe055ea-200a-4a04-9446-dd6c4cd34f5e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1550060164760_0003, Tracking URL = http://node1:8088/proxy/application_1550060164760_0003/
Kill Command = /opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/lib/hadoop/bin/hadoop job  -kill job_1550060164760_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-02-14 23:25:16,558 Stage-1 map = 0%,  reduce = 0%
2019-02-14 23:26:20,382 Stage-1 map = 0%,  reduce = 0%
2019-02-14 23:27:20,669 Stage-1 map = 0%,  reduce = 0%
2019-02-14 23:27:37,674 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.92 sec
MapReduce Total cumulative CPU time: 1 seconds 920 msec
Ended Job = job_1550060164760_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://node1:8020/user/hive/warehouse/emp_in/.hive-staging_hive_2019-02-14_23-21-37_609_1246238984503122049-1/-ext-10000
Loading data to table default.emp_in
Table default.emp_in stats: [numFiles=1, numRows=14, totalSize=661, rawDataSize=647]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.92 sec   HDFS Read: 4914 HDFS Write: 732 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 920 msec
OK
empno   ename   job     mgr     hiredate        sal     comm    deptno
Time taken: 362.312 seconds
hive (default)> select * from emp_in;
OK
empno   ename   job     mgr     hiredate        sal     comm    deptno
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
Time taken: 0.142 seconds, Fetched: 14 row(s)

可以看出通过这种方式加载数据的时候会调用MapReduce,在MapReduce运行完毕后数据会加载到新建的表中。

5.创建表的时候通过select加载数据

我们在创建子表的时候可以通过使用select查询原表数据的方式来加载数据,这种方式其实是上一种数据加载方式的合并,其效果是一样的。

create table if not exists default.dept_cats
as
select * from dept ;

6.创建表的时候通过location指定加载

这种方式适用于外部表,在创建外部表的时候指定对应的路径。参考下面新建外部表的语句:

create EXTERNAL table IF NOT EXISTS default.emp_ext2(
             empno int,
             ename string,
             job string,
             mgr int,
             hiredate string,
             sal double,
             comm double,
             deptno int
             )
             ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
             location '/user/hive/user1/warehouse/emp_ext2';

hive中常用的数据加载方式就是这6种,针对不同的应用场景我们可以选择不同的加载方式。