第二章 Getting Started
1.Hive最大的局限性是什么?
一是不支持行级别的增删改(insert, delete, update)
二是查询性能非常差(基于Hadoop MapReduce),不适合延迟小的交互式任务
三是不支持事务
2. Hive MetaStore是干什么的?
Hive persists table schemas and other system metadata.
The information required for table schema, partition information, etc.,is small, typically much smaller than the large quantity of data stored in Hive. As a result, you typically don’t need a powerful dedicated database
server for the metastore. However because it represents a Single Point of Failure (SPOF), it is strongly recommended that you replicate and back up this database using the standard techniques you would normally
use with other relational database instances. We won’t discuss those techniques here.
3. 在Hadoop分布式集群环境下,Hive提交MapReduce作业到Hadoop集群,
一:Hive是否需要安装到集群的每台机器上?答案:不需要,Hive只需要一个实例,Hive可以看成MapReduce作业提交客户端
二:Hive是否要安装到Hadoop集群Master节点所在的机器上?答案:不需要,Hive可以远程提交作业到Hadoop Master
4. Hive的Word Count计数
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) w
GROUP BY word
ORDER BY word;
第三章 Data Types and File Formats
1. 在使用Hive时,在Hive的bin目录下生成了一个metastore_db目录,这个目录是怎么生成的?
Hive默认使用Derby作为MetaStore的存储数据库,而Derby是基于文件的数据库,那么当使用Derby时,Derby默认在当前hive的工作目录下创建一个metastore_db目录作为MetaStore数据库目录。
也就是说,hive命令运行在哪个目录下,Derby就在哪个目录下创建metastore_db目录
2. ./hive命令实际上等价于./hive --service cli,cli是hive命令启动的默认服务,也就是说hive实际上是启动服务的命令
3. 打开hwi服务时报错,参考
4. Create Table行、列、集合元素分隔符
CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
ROW FORMAT DELIMITED是FIELDS TERMINATED BY、COLLECTION ITEMS TERMINATED BY和MAP KEYS TERMINATED BY子句的开始,也就是说,指定行、列、集合元素分隔符时(或者其中之一),必须使用ROW FORMAT DELIMITED开头。
FIELDS TERMINATED BY指定列分隔符
COLLECTION ITEMS TERMINATED BY指定STRUCT元素项、数组元素项、MAP元素项(KV元素对)之间的分隔符
MAP KEYS TERMINATED BY指定MAP元素项中KV之间的分隔符
5. 完整性检查
Hive在写数据时,比如LOAD DATA,并不会进行Schema检查(关系型数据写数据时的数据完整性检查是关系型数据库的核心能力之一,Schema on Write);相反,HIVE在读数据时,会对每一行进行按照Schema进行分隔解析,同时保证最大限度的容错性,
比如数据中列不够时,自动补上null;数据列过多,自动丢弃
第四章 HiveQL: DataDefinition
1. HQL is perhaps closest to MySQL’s dialect, but with significant differences. Hive offers no support for rowlevel inserts, updates, and deletes.
2. Hive doesn’t support transactions、
3. HiveQL DDL are used for creating, altering, and dropping databases, tables, views, functions, and indexes.(没提Partition、Bucket?)
4. SHOW and DESCRIBE commands for listing and describing items
5. Hive中的Database
The Hive concept of a database is essentially just a catalog or namespace of tables. However, they are very useful for larger clusters with multiple teams and users, as a way of avoiding table name collisions. It’s also common to use databases to organize production tables into logical groups. If you don’t specify a database, the default database is used.
4.1 数据库操作
4.1.1
hive> CREATE DATABASE financials;
./hdfs dfs -ls /user/hive/warehouse
drwxr-xr-x - hadoop supergroup 0 2015-04-04 04:35 /user/hive/warehouse/financials.db
4.1.2. 在Hive命令行终端显示当前正在操作的数据库的名字
hive> set hive.cli.print.current.db=true;
hive (financials)>
4.1.3. 数据库基本操作
hive> CREATE DATABASE financials COMMENT 'Holds all financial tables'; ///创建数据库,同时指定一个COMMENT
hive> DESCRIBE DATABASE financials; ///对数据库进行描述
hive> USE financials; ///切断数据库
hive> DROP DATABASE
hive> DROP DATABASE IF EXISTS financials; //如果数据库中有table,那么不允许删除数据库
hive> DROP DATABASE IF EXISTS financials CASCADE; ///如果数据库中有table,则级联删除table
hive> CREATE DATABASE financials WITH DBPROPERTIES ('creator' = 'Mark Moneybags', 'date' = '2012-01-02'); ///创建数据库时指定数据库的KV属性
hive> DESCRIBE DATABASE EXTENDED financials; ///EXTENDED可以显示创建数据库指定的KV属性
ALTER DATABASE financials SET DBPROPERTIES ('creator = 'Joe Dba'); ///数据库的名字、数据位置都不可修改,但是可以修改数据库的KV属性
4.2 数据库表操作
4.2.1 创建表
CREATE TABLE IF NOT EXISTS mydb.employees (
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
COMMENT 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';
可以针对列添加COMMENT,也可以针对整个表添加COMMENT
可以指定表中的数据存放的路径,默认是warehouse的路径+数据库的名字+表名字,即上面使用的默认路径
4.2.2 描述表
hive> DESCRIBE EXTENDED mydb.employees;
name string Employee name
salary float Employee salary
subordinates array<string> Names of subordinates
deductions map<string,float> Keys are deductions names, values are percentages
address struct<street:string,city:string,state:string,zip:int> Home address
Detailed Table Information Table(tableName:employees, dbName:mydb, owner:me,
...
location:hdfs://master-server/user/hive/warehouse/mydb.db/employees,
parameters:{creator=me, created_at='2012-01-02 10:00:00',
last_modified_user=me, last_modified_time=1337544510,
comment:Description of the table, ...}, ...)