文章目录

  • 为什么要在本地使用Spark连接Hive?
  • 实现
  • 1. Scala实现
  • 1. 端口设置
  • 2. metastore设置
  • 3. 主机名设置
  • 4. 环境变量设置
  • 2. pyspark实现


 

为什么要在本地使用Spark连接Hive?

很多时候,我们在编写好Spark应用程序之后,想要读取Hive表测试一下,但又不想进行打jar包、上传集群、spark-submit这一系列麻烦的操作,此时我们就可以在本地IDEA中运行Spark应用程序直接读取远程Hadoop集群中的Hive表。

 

实现

Spark版本:2.0+

1. Scala实现

代码如下:

import org.apache.spark.sql.SparkSession

object SparkDemo {
	val spark = SparkSession
		.builder()
		.appName("Spark to Hive")
		.master("local[4]")
		.config("hive.metastore.uris", "thrift://bigdata01:9083,thrift://bigdata02:9083")
		.enableHiveSupport()
		.getOrCreate()
	
	val df = spark.read.table("db.table")
	df.show(false)

	spark.close()
}

1. 端口设置

首先,要确保的是,本地电脑能访问9083端口(Hive元数据库端口)。像有些公司的集群没有自己的机房,而是搭建在一些公有云(阿里云、腾讯云)服务器上的,可能会造成本地电脑无法访问,这就需要运维人员配合给把指定的端口打开。

2. metastore设置

在创建SparkSession实例时, “hive.metastore.uris” 的值为集群Hive安装目录下,文件hive-site.xml 中配置hive.metastore.uris所对应的值,这个配置表示Hive的Metastore Server所在的节点。

3. 主机名设置

一般来讲,为了安全性考虑,会给集群节点设置主机名,而不是直接使用Ip地址。Windows系统中,为了使用别名而不是Ip地址,需要在C:\Windows\System32\drivers\etc\hosts 中配置Ip地址和别名的映射关系, 这样才能在代码中使用别名来找到对应的Ip地址。添加的具体配置的形式如下:

10.1.11.10           bigdata01
10.1.11.11           bigdata02
10.1.11.12           bigdata03
...

4. 环境变量设置

通常情况下,完成上面这些配置之后,运行Spark应用程序还是会报错,如下:

20/11/23 20:55:35 INFO metastore: Trying to connect to metastore with URI thrift://bigdata01:9083
20/11/23 20:55:35 WARN Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
	at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
	at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
	at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
	at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
	at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:272)
	at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:384)
	at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286)
	at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
	at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:215)
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
	at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
	at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
	at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
	at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141)
	at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.setCurrentDatabase(SessionCatalog.scala:253)
	at org.apache.spark.sql.execution.command.SetDatabaseCommand.run(databases.scala:59)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
	at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
	at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
	at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:195)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
	at scala.HiveDataCleaing.YunYing.SupplierCompanyData$.main(SupplierCompanyData.scala:15)
	at scala.HiveDataCleaing.YunYing.SupplierCompanyData.main(SupplierCompanyData.scala)
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
	at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
	at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
	at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
	... 44 more
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
	... 50 more
Caused by: java.lang.NullPointerException
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
	at org.apache.hadoop.util.Shell.run(Shell.java:455)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
	at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
	at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:84)
	at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:52)
	at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.getGroups(JniBasedUnixGroupsMappingWithFallback.java:51)
	at org.apache.hadoop.security.Groups.getGroups(Groups.java:176)
	at org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:1488)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:436)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:236)
	at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
	... 55 more

这是因为要想在Windows上访问Hadoop集群中的文件,必须要使用一些Windows的本地APIs,来实现类似posix的文件访问权限,而这些本地APIs是被实现在一个叫 winutils.exe 的文件中的。

  1. 我们可以到https://github.com/cdarlint/winutils/tree/master/下载对应Hadoop版本的winutils.exe文件
  2. 然后放到我们Windows系统的指定目录下,比如:D:\Hadoop\bin\winutils.exe。
  3. 在Windows系统中添加 HADOOP_HOME 环境变量
  4. 最后再执行Spark应用程序就可以读取远程Hadoop集群上的Hive表数据了。

 

2. pyspark实现

代码如下:

import os
from pyspark.sql import SparkSession

os.environ["SPARK_HOME"] = "F:\App\spark-2.3.1-bin-hadoop2.6"

if __name__ == '__main__':
	spark = SparkSession\
        .builder \
        .appName("Spark to Hive") \
        .master("local[4]") \
        .config("hive.metastore.uris", "thrift://bigdata01:9083,thrift://bigdata02:9083") \
        .enableHiveSupport()\
        .getOrCreate()
	
	df = spark.read.table("db.table")
	df.show(truncate=False)

	spark.stop()

pyspark版本跟Scala不同的地方在于,pyspark运行时会依赖Spark的一些文件,因此需要设置环境变量SPARK_HOME,你可以到Spark官网下载相应版本的Spark,文件名称spark-2.3.3-bin-hadoop2.6.tgz;下载之后,解压到指定目录下;最后,配置环境变量SPARK_HOME的值为之前解压文件的根路径即可。

参考