sqoop进入mysql sqoop eval mysql

转载

数据侠客行 2023-12-20 01:04:28

文章标签 sqoop进入mysql sqoop 数据库 mysql hive 文章分类 MySQL 数据库

sqoop使用入门

数据导入 import

mysql to hdfs
mysql to hive

mysql导入hive 过程
普通导入
导入到hive的指定库和表

mysql to hbase

增量数据导入
数据导出

hdfs to mysql
hive to mysql
hbase to mysql

sqoop就是做数据迁移的，主要用于数据的迁入和迁出，是hadoop常用的组件。

sqoop使用时，第一个要用到的命令是：sqoop help 查看帮助文档
第二个常用命令是：sqoop help COMMAND 某一命令的使用帮助

数据导入 import

sqoop help import

Common arguments:
   --connect <jdbc-uri>                         Specify JDBC connect
                                                string
   --connection-manager <class-name>            Specify connection manager
                                                class name
   --connection-param-file <properties-file>    Specify connection
                                                parameters file
   --driver <class-name>                        Manually specify JDBC
                                                driver class to use
   --hadoop-home <hdir>                         Override
                                                $HADOOP_MAPRED_HOME_ARG
   --hadoop-mapred-home <dir>                   Override
                                                $HADOOP_MAPRED_HOME_ARG
   --help                                       Print usage instructions
-P                                              Read password from console
   --password <password>                        Set authentication
                                                password
   --password-alias <password-alias>            Credential provider
                                                password alias
   --password-file <password-file>              Set authentication
                                                password file path
   --relaxed-isolation                          Use read-uncommitted
                                                isolation for imports
   --skip-dist-cache                            Skip copying jars to
                                                distributed cache
   --username <username>                        Set authentication
                                                username
   --verbose                                    Print more information
                                                while working

--connect		指定mysql连接
--password  	指定mysql的密码
--username		指定mysql的用户名
-m   			指定maptask的任务数量 
--target-dir	指定hdfs的输出目录
--fields-terminated-by  指定输入到hdfs的文件的字段之间的分隔符
--columns		指定需要导入的列
--where			指定过滤条件
--split-by		指定切分maptask的依据字段的
--table			指定mysql中需要导入到大数据平台的表名

mysql to hdfs

下面举例进行说明

sqoop import \
--connect jdbc:mysql://hadoop03:3306/mysql \
--username root \
--password 123456 \
--table user \
-m 1

说明：
sqoop import \  ：反斜杠 \ 代表换行
 --connect jdbc:mysql://hadoop03:3306/mysql \   ：连接方式是mysql；主机是 hadoop03；端口是3306；mysql是数据库名
 --table user \  ：表名是 user

导入hdfs的默认输出目录是：  /user/jacob/user  其中，jacob是用户名，第二个user是表名
默认的分隔符是 逗号 “,”

指定输出目录、分隔符：

sqoop import \
--connect jdbc:mysql://hadoop03:3306/mysql \
--username root \
--password 123456 \
--table help_keyword \
--target-dir /user/data01/mydata/help_keyword \
--fields-terminated-by '\t' \
-m 1

说明：
--target-dir  指定导入路径
--fields-terminated-by '\t'  指定导入分隔符为：\t  （即tab）

指定需要导入的字段（查询指定列）

指定需要导入的字段
// 查询指定列
sqoop import   \
--connect jdbc:mysql://hadoop03:3306/mysql   \
--username root  \
--password 123456   \
--columns "name" \
--table help_keyword  \
--target-dir /user/sqoop/data01/myoutport1  \
-m 1

说明：
--columns "name" \  指定导入的列

指定过滤条件

例如要实现：select name from help_keyword where help_keyword_id>100;
可以使用如下语句：

sqoop import   \
--connect jdbc:mysql://hadoop03:3306/mysql   \
--username root  \
--password 123456   \
--columns "name" \
--where "help_keyword_id>100" \
--table help_keyword  \
--target-dir /user/sqoop/data01/where_condition  \
-m 1

说明：
--where  后面跟一个where条件

指定sql查询语句结果导入

sqoop import   \
--connect jdbc:mysql://hadoop03:3306/mysql   \
--username root  \
--password 123456   \
--query 'select * from help_keyword where  help_keyword_id>200 and $CONDITIONS' \
--target-dir /user/sqoop/data01/sql_statement  \
-m 1

注意：
    1） --query 和  --where  --columns  --table 不可以一起使用
    2)报错
    Query [select * from help_keyword where help_keyword_id>200] must contain '$CONDITIONS' in WHERE clause
    $CONDITIONS 没有实际含义，仅仅是语法要求。
    3)sql语句一定要 使用单引号

指定启动多个maptask任务

sqoop import   \
--connect jdbc:mysql://hadoop03:3306/  \
--username root  \
--password 123456   \
--target-dir /user/sqoop/data01/some_mtask  \
--query 'select * from mysql.help_keyword where help_keyword_id > 100 and $CONDITIONS' \
--split-by  help_keyword_id \
--fields-terminated-by '\t'  \
-m 3

说明：
-m 是指多个maptask，必须配合 --split-by 使用。（整形数据、一般选取自增主键，防止数据倾斜）

help_keyword_id 是从0开始连续顺序递增的，总计352条数据，3个maptask，每个maptask 117 条数据。所以，数据结果应该是：
part-m-00000    101----217
part-m-00001    218---334
part-m-00002    335---452

多个maptask任务，如何进行数据分配呢？

首先：  获取最小id 和 最大id 
然后：  (最大id-最小id +1 ) /maptask  ，求的是每一个maptask分配的数据跨度
最后：  每一个maptask顺序获取数据

eg：有如下15条数据
    0
    1
    1
    1
    1
    2
    2
    2
    2
    3
    4
    5
    6
    11
    11
    11
数据范围是：0～11，所以每个maptask的数据跨度范围是：12/3=4
最终maptask的数据分配范围如下：
maptask0   0-3   10个数据
maptask1   4-7   3个数据
maptask2   8-11  3个数据

造成一定程度的数据倾斜。为了避免数据倾斜，一般 split-by的时候，选择的是自增主键

mysql to hive

默认启动4个maptask任务

--hive-import   指定导入到hive中
--hive-overwrite   覆盖导入
--hive-database  指定数据库
--hive-table 指定hive中的表

mysql导入hive 过程

1）先把这个数据导入到hdfs的默认路径下 （/user/jacob/help_keyword）
2）在hive中建表 
3）将hdfs的文件，加载到hive的表中 （移动文件到hive默认存储目录）

普通导入

sqoop import   \
--connect jdbc:mysql://hadoop03:3306/mysql \
--username root  \
--password 123456   \
--table help_keyword   \
--hive-import \
-m 1

报错：
Output directory hdfs://bd1904/user/jacob/help_keyword already exists

日志：
20/03/21 07:31:17 INFO hive.HiveImport: OK
20/03/21 07:31:17 INFO hive.HiveImport: Time taken: 20.432 seconds
20/03/21 07:31:18 INFO hive.HiveImport: Loading data to table default.help_keyword
20/03/21 07:31:20 INFO hive.HiveImport: OK
20/03/21 07:31:20 INFO hive.HiveImport: Time taken: 2.994 seconds
20/03/21 07:31:21 INFO hive.HiveImport: Hive import complete.
20/03/21 07:31:21 INFO hive.HiveImport: Export directory is contains the _SUCCESS file only, removing the directory.

所以需要删除已存在的目录：user/jacob/help_keyword ，之后再次导入即可。

所以mysql导入hive的过程是：
1）先把这个数据导入到hdfs的默认路径下 （/user/jacob/help_keyword）
2）在hive中建表 
3）将hdfs的文件，加载到hive的表中 （移动文件到hive默认存储目录）

默认导入到hive的目录是：  defalut . 表名
导入后存储路径是：/user/hive/warehouse/表名
导入数据的默认分割符是：  \001

导入到hive的指定库和表

sqoop import  \
--connect jdbc:mysql://hadoop03:3306/mysql  \
--username root  \
--password 123456  \
--table help_keyword  \
--fields-terminated-by "\t"  \
--lines-terminated-by "\n"  \
--hive-import  \
--hive-overwrite  \
--create-hive-table  \
--delete-target-dir \
--hive-database  test_sqoop \
--hive-table new_help_keyword

注意：在sqoop导入数据到hive的时候，可以默认帮建表，但是不会创建数据库。
数据库需要手动创建。

说明：
--create-hive-table  创建hive表

mysql to hbase

sqoop import \
--connect jdbc:mysql://hadoop02:3306/mysql \
--username root \
--password 123456 \
--table help_keyword \
--hbase-table help_keyword_hbase \
--column-family info \
--hbase-row-key help_keyword_id

说明：
--hbase-table  指定hbase的表名
--column-family 指定hbase对应的列族下
--hbase-row-key 指定hbase的rowkey 
注意：hbase中的表，  需要手动创建
create "help_keyword_hbase","info"

增量数据导入

数据导入分为：全量数据导入和增量数据导入。

全量数据导入：每次导入全部的数据
增量数据导入：每次导入新增的数据

增量数据导入需要注意3个参数：

--check-column <column>    Source column to check for incremental change  
增量导入的校验键，一般选主键

--incremental <import-type>    Define an incremental import of type 'append' or 'lastmodified'  指定增量导入的类型
append： 追加导入
lastmodified  修改导入
        
--last-value <value>    Last imported value in the incremental check column  
指定 校验键 上一次导入的最后一个值，这次是从这个值的下一个开始导入

sqoop import   \
--connect jdbc:mysql://hadoop02:3306/mysql   \
--username root  \
--password 123456   \
--table help_keyword  \
--target-dir /user/data01/data_append  \
--incremental  append  \
--check-column  help_keyword_id \
--last-value 200  \
-m 1

说明：此增量导入是加载mysql数据后，只选取 主键 从201之后的数据，加载到hdfs中。

数据导出

一般经过分析之后的数据，才有导出到mysql的需求。

hdfs to mysql

sqoop export \
--connect jdbc:mysql://hadoop02:3306/sqoopdb  \
--username root \
--password 123456 \
--table hdfs_to_mysql \
--export-dir /user/data01/mydata/help_keyword \
--fields-terminated-by '\t'

说明：
--export-dir  ：指定导出的hdfs的路径
--fields-terminated-by  ：指定hdfs文件的字段分割符
--table  ：指定的导出的mysql中的表

mysql中的 数据库 和表 都需要自己手动创建。
create database sqoopdb;
use sqoopdb;
CREATE TABLE hdfs_to_mysql ( 
   id INT, 
   name VARCHAR(60)
);

hive to mysql

sqoop export \
--connect jdbc:mysql://hadoop02:3306/sqoopdb \
--username root \
--password 123456 \
--table hive_to_mysql \
--export-dir /user/hive/warehouse/test_sqoop.db/new_help_keyword \
--input-fields-terminated-by '\t'

说明：
--export-dir   ：指定hive表所在的hdfs的路径
--input-fields-terminated-by  ：指定hive的表文件的分隔符

CREATE TABLE hive_to_mysql ( 
   id INT, 
   name VARCHAR(60)
);