文章目录
- 为什么要在本地使用Spark连接Hive?
- 实现
- 1. Scala实现
- 1. 端口设置
- 2. metastore设置
- 3. 主机名设置
- 4. 环境变量设置
- 2. pyspark实现
为什么要在本地使用Spark连接Hive?
很多时候,我们在编写好Spark应用程序之后,想要读取Hive表测试一下,但又不想进行打jar包、上传集群、spark-submit这一系列麻烦的操作,此时我们就可以在本地IDEA中运行Spark应用程序直接读取远程Hadoop集群中的Hive表。
实现
Spark版本:2.0+
1. Scala实现
代码如下:
import org.apache.spark.sql.SparkSession
object SparkDemo {
val spark = SparkSession
.builder()
.appName("Spark to Hive")
.master("local[4]")
.config("hive.metastore.uris", "thrift://bigdata01:9083,thrift://bigdata02:9083")
.enableHiveSupport()
.getOrCreate()
val df = spark.read.table("db.table")
df.show(false)
spark.close()
}
1. 端口设置
首先,要确保的是,本地电脑能访问9083端口(Hive元数据库端口)。像有些公司的集群没有自己的机房,而是搭建在一些公有云(阿里云、腾讯云)服务器上的,可能会造成本地电脑无法访问,这就需要运维人员配合给把指定的端口打开。
2. metastore设置
在创建SparkSession实例时, “hive.metastore.uris” 的值为集群Hive安装目录下,文件hive-site.xml 中配置hive.metastore.uris所对应的值,这个配置表示Hive的Metastore Server所在的节点。
3. 主机名设置
一般来讲,为了安全性考虑,会给集群节点设置主机名,而不是直接使用Ip地址。Windows系统中,为了使用别名而不是Ip地址,需要在C:\Windows\System32\drivers\etc\hosts 中配置Ip地址和别名的映射关系, 这样才能在代码中使用别名来找到对应的Ip地址。添加的具体配置的形式如下:
10.1.11.10 bigdata01
10.1.11.11 bigdata02
10.1.11.12 bigdata03
...
4. 环境变量设置
通常情况下,完成上面这些配置之后,运行Spark应用程序还是会报错,如下:
20/11/23 20:55:35 INFO metastore: Trying to connect to metastore with URI thrift://bigdata01:9083
20/11/23 20:55:35 WARN Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:272)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:384)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286)
at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.setCurrentDatabase(SessionCatalog.scala:253)
at org.apache.spark.sql.execution.command.SetDatabaseCommand.run(databases.scala:59)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:195)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
at scala.HiveDataCleaing.YunYing.SupplierCompanyData$.main(SupplierCompanyData.scala:15)
at scala.HiveDataCleaing.YunYing.SupplierCompanyData.main(SupplierCompanyData.scala)
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
... 44 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
... 50 more
Caused by: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:808)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:791)
at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:84)
at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:52)
at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.getGroups(JniBasedUnixGroupsMappingWithFallback.java:51)
at org.apache.hadoop.security.Groups.getGroups(Groups.java:176)
at org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:1488)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:436)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:236)
at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
... 55 more
这是因为要想在Windows上访问Hadoop集群中的文件,必须要使用一些Windows的本地APIs,来实现类似posix的文件访问权限,而这些本地APIs是被实现在一个叫 winutils.exe 的文件中的。
- 我们可以到https://github.com/cdarlint/winutils/tree/master/下载对应Hadoop版本的winutils.exe文件
- 然后放到我们Windows系统的指定目录下,比如:D:\Hadoop\bin\winutils.exe。
- 在Windows系统中添加 HADOOP_HOME 环境变量
- 最后再执行Spark应用程序就可以读取远程Hadoop集群上的Hive表数据了。
2. pyspark实现
代码如下:
import os
from pyspark.sql import SparkSession
os.environ["SPARK_HOME"] = "F:\App\spark-2.3.1-bin-hadoop2.6"
if __name__ == '__main__':
spark = SparkSession\
.builder \
.appName("Spark to Hive") \
.master("local[4]") \
.config("hive.metastore.uris", "thrift://bigdata01:9083,thrift://bigdata02:9083") \
.enableHiveSupport()\
.getOrCreate()
df = spark.read.table("db.table")
df.show(truncate=False)
spark.stop()
pyspark版本跟Scala不同的地方在于,pyspark运行时会依赖Spark的一些文件,因此需要设置环境变量SPARK_HOME,你可以到Spark官网下载相应版本的Spark,文件名称spark-2.3.3-bin-hadoop2.6.tgz;下载之后,解压到指定目录下;最后,配置环境变量SPARK_HOME的值为之前解压文件的根路径即可。
参考