spark thrift spark thriftserver部署

转载

jordana 2023-06-05 16:31:37

文章标签 spark thrift sparkSQL hive spark Thrift 文章分类 Spark 大数据

SparkSql由Core、Catalyst、Hive、Hive-thriftserver组成
ThriftServer是一个JDBC/ODBC接口，用户可以通过JDBC/ODBC连接ThriftServer来访问SparkSQL的数据。
Thriftserver启动时会启动一个sparkSql application。
通过JDBC/ODBC连接到该server的客户端会共享该server的程序资源,不同的客户端和用户能共享数据。
Thriftserver启动还会启动一个侦听器，等待客户端的连接和提交查询。
连接步骤：(spark和hive不在同一个节点的情况下)
1.在运行Spark thrift server中需要使用到HiveMetastore，故需要在Spark中添加其uris。具体方法：先将hive的hive-site.xml的文件复制到$SPARK_HOME/conf目录下，之后再添加一些配置：

//连接远端hive的metestone元数据的uirs（这个配置在hive的hive-site中也需要配一下，开启元数据信息的时候就不用指定端口号了）
<property>
            <name>hive.metastore.uris</name>
            <value>thrift://192.168.252.145:9083</value>
            <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
        </property>
//启动spark-thriftserver服务的节点IP地址
<property>
        <name>hive.server2.thrift.bind.host</name>
        <value>192.168.252.147</value>
            <description>Bind host on which to run the HiveServer2 Thrift servic	e.</description>
</property>
//spark-thriftserver的端口号，不配的话默认是10000,Hiveserver2默认也是监听10000端口，为了避免冲突，需要修改sparkthriftserver的端口
 <property>
        <name>hive.server2.thrift.port</name>
        <value>10001</value>
        <description>Port number of HiveServer2 Thrift interface when hive.server2.transport.mode is 'binary'.</description>
    </property>
//spark-thriftserver工作的最小线程数，默认为5
    <property>
        <name>hive.server2.thrift.min.worker.threads</name>
        <value>5</value>
        <description>Minimum number of Thrift worker threads</description>
</property>
//spark-thriftserver工作的最大线程数，默认是500
    <property>
        <name>hive.server2.thrift.max.worker.threads</name>
        <value>500</value>
        <description>Maximum number of Thrift worker threads</description>
    </property>
//设置成false则，yarn作业获取到的hiveserver2用户都为hive用户        
	<property>  
        <name>hive.server2.enable.doAs</name>
        <value>false</value>
        </property>

2.开启hive的元数据服务hive --service metastore &（在连接之前必须开启这个，并且不能关闭）
3.将mysql的连接驱动jar包mysql-connector-java放到spark的jars目录下
4.启动Thriftserver , 把thriftserver运行在我们的集群上
sbin/start-thriftserver.sh(这里启动的时候可以设定各种参数)

5.编写JDBC方式访问Hive数据库

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
    <version>1.2.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>3.1.2</version>
</dependency>
public class JDBC_SparkSQL {
    public static void main(String[] args) throws SQLException {
        String url = "jdbc:hive2://192.168.252.147:10001/default";
        try {
            Class.forName("org.apache.hive.jdbc.HiveDriver");
        } catch (ClassNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        Connection conn = DriverManager.getConnection(url, "root", "");
        Statement stmt = conn.createStatement();
        String sql = "select name,cost from business order by cost";
        ResultSet res = stmt.executeQuery(sql);
        while (res.next()) {
            System.out.println(res.getString(1)+res.getString(2));
        }
	res.close();
	stmt.close();
	conn.close();
    }
}

项目的数据存在HDFS上得和hive有关联，操作如下：
创建表时通过Location指定加载数据路径，建表时指定路径，它可以自动去加载数据到表里。（不用load,也不用移动数据）。建外部表删除该表并不会删除掉这份数据，不过描述表的元数据信息会被删除掉。
例：

create external table if not exists student(
 id int, name string
 )
 row format delimited fields terminated by ‘\t’
 location ‘/user/hive/warehouse/student’;

这里创建表可以在客户端也可以用Java的api。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。