写在前面,根因问题是我下载了spark without hadoop版本导致,应该使用spark with hadoop 版本!!!
----------------------------------------------
参考步骤:https://docs.nebula-graph.com.cn/3.1.0/nebula-algorithm/#_2
错误:
2022-05-31 17:40:16,459 WARN [main] util.Utils (Logging.scala:logWarning(66)) - Your hostname, bonelee-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
2022-05-31 17:40:16,461 WARN [main] util.Utils (Logging.scala:logWarning(66)) - Set SPARK_LOCAL_IP if you need to bind to another address
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
at org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopConfigurations(SparkHadoopUtil.scala:464)
at org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:436)
at org.apache.spark.deploy.SparkSubmit$$anonfun$2.apply(SparkSubmit.scala:323)
at org.apache.spark.deploy.SparkSubmit$$anonfun$2.apply(SparkSubmit.scala:323)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:323)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:784)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:930)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:939)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
提交方式:
spark-submit --master "local" \
--conf spark.app.name="g1" \
--conf spark.executor.extraLibraryPath=/home/bonelee/Desktop/nebula-algorithm/guava-14.0.jar \
--conf spark.executor.extraClassPath=/home/bonelee/Desktop/nebula-algorithm/guava-14.0.jar \
--driver-class-path /home/bonelee/Desktop/nebula-algorithm/guava-14.0.jar \
--driver-library-path /home/bonelee/Desktop/nebula-algorithm/guava-14.0.jar \
--class com.vesoft.nebula.algorithm.Main nebula-algorithm/target/nebula-algorithm-3.0.0.jar -p /home/bonelee/Desktop/nebula-algorithm/application.conf
版本说明
Scala 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_333).
Spark: 2.4.8-bin-without-hadoop
hadoop:3.2.3
nebula:3.0.0
nebula-algorithm:3.0.0
application配置:
{
# Spark relation config
spark: {
app: {
name: LPA
# spark.app.partitionNum
partitionNum:100
}
master:local
}
data: {
# data source. optional of nebula,csv,json
source: nebula
# data sink, means the algorithm result will be write into this sink. optional of nebula,csv,text
sink: csv
# if your algorithm needs weight
hasWeight: false
}
# Nebula Graph relation config
nebula: {
# algo's data source from Nebula. If data.source is nebula, then this nebula.read config can be valid.
read: {
# Nebula metad server address, multiple addresses are split by English comma
metaAddress: "127.0.0.1:9559"
# Nebula space
space: basketballplayer
# Nebula edge types, multiple labels means that data from multiple edges will union together
labels: ["serve"]
# Nebula edge property name for each edge type, this property will be as weight col for algorithm.
# Make sure the weightCols are corresponding to labels.
weightCols: ["start_year"]
}
# algo result sink into Nebula. If data.sink is nebula, then this nebula.write config can be valid.
write:{
# Nebula graphd server address, multiple addresses are split by English comma
graphAddress: "127.0.0.1:9669"
# Nebula metad server address, multiple addresses are split by English comma
metaAddress: "127.0.0.1:9559"
user:root
pswd:nebula
# Nebula space name
space:nb
# Nebula tag name, the algorithm result will be write into this tag
tag:pagerank
# algorithm result is insert into new tag or update to original tag. type: insert/update
type:insert
}
}
local: {
# algo's data source from Nebula. If data.source is csv or json, then this local.read can be valid.
read:{
filePath: "file:///tmp/algo_edge.csv"
srcId:"src"
# dstId column
dstId:"dst"
# weight column
weight: "weight"
# if csv file has header
header: true
# csv file's delimiter
delimiter:","
}
# algo result sink into local file. If data.sink is csv or text, then this local.write can be valid.
write:{
resultPath:/tmp/count
}
}
algorithm: {
# the algorithm that you are going to execute,pick one from [pagerank, louvain, connectedcomponent,
# labelpropagation, shortestpaths, degreestatic, kcore, stronglyconnectedcomponent, trianglecount,
# betweenness, graphtriangleCount, clusteringcoefficient, bfs, hanp, closeness, jaccard, node2vec]
executeAlgo: graphtrianglecount
# PageRank parameter
pagerank: {
maxIter: 10
resetProb: 0.15 # default 0.15
}
# Louvain parameter
louvain: {
maxIter: 20
internalIter: 10
tol: 0.5
}
# connected component parameter.
connectedcomponent: {
maxIter: 20
}
# LabelPropagation parameter
labelpropagation: {
maxIter: 20
}
# ShortestPaths parameter
shortestpaths: {
# several vertices to compute the shortest path to all vertices.
landmarks: "1"
}
# Vertex degree statistics parameter
degreestatic: {}
# KCore parameter
kcore:{
maxIter:10
degree:1
}
# Trianglecount parameter
trianglecount:{}
# graphTriangleCount parameter
graphtrianglecount:{}
# Betweenness centrality parameter. maxIter parameter means the max times of iterations.
betweenness:{
maxIter:5
}
# Clustering Coefficient parameter. The type parameter has two choice, local or global
# local type will compute the clustering coefficient for each vertex, and print the average coefficient for graph.
# global type just compute the graph's clustering coefficient.
clusteringcoefficient:{
type: local
}
# ClosenessAlgo parameter
closeness:{}
# BFS parameter
bfs:{
maxIter:5
root:"10"
}
# HanpAlgo parameter
hanp:{
hopAttenuation:0.1
maxIter:10
preference:1.0
}
#Node2vecAlgo parameter
node2vec:{
maxIter: 10,
lr: 0.025,
dataNumPartition: 10,
modelNumPartition: 10,
dim: 10,
window: 3,
walkLength: 5,
numWalks: 3,
p: 1.0,
q: 1.0,
directed: false,
degree: 30,
embSeparate: ",",
modelPath: "hdfs://127.0.0.1:9000/model"
}
# JaccardAlgo parameter
jaccard:{
tol: 1.0
}
}
}
如果将上述数据源nebula修改为csv则运行不会出错!!!
------------------------------------------------------------------------------------------------------------------------------
使用nebula graph 算法提交spark任务时报错:读取数据源使用的nebula!!!
Exception in thread "main" java.lang.NoSuchMethodError: 'java.lang.String com.google.common.net.HostAndPort.getHostText()'
at com.vesoft.nebula.connector.NebulaOptions$$anonfun$getMetaAddress$1.apply(NebulaOptions.scala:186)
at com.vesoft.nebula.connector.NebulaOptions$$anonfun$getMetaAddress$1.apply(NebulaOptions.scala:183)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at com.vesoft.nebula.connector.NebulaOptions.getMetaAddress(NebulaOptions.scala:183)
at com.vesoft.nebula.connector.reader.NebulaSourceReader.getSchema(NebulaSourceReader.scala:45)
at com.vesoft.nebula.connector.reader.NebulaSourceReader.readSchema(NebulaSourceReader.scala:30)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:175)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
at com.vesoft.nebula.connector.connector.package$NebulaDataFrameReader.loadEdgesToDF(package.scala:172)
at com.vesoft.nebula.algorithm.reader.NebulaReader$$anonfun$read$1.apply$mcVI$sp(DataReader.scala:52)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at com.vesoft.nebula.algorithm.reader.NebulaReader.read(DataReader.scala:38)
at com.vesoft.nebula.algorithm.Main$.createDataSource(Main.scala:118)
at com.vesoft.nebula.algorithm.Main$.main(Main.scala:84)
at com.vesoft.nebula.algorithm.Main.main(Main.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:855)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:930)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:939)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
提交方式:
/home/bonelee/spark-2.4.8-bin-without-hadoop/bin/spark-submit --master "local" \
--class com.vesoft.nebula.algorithm.Main nebula-algorithm/target/nebula-algorithm-3.0.0.jar -p /home/bonelee/Desktop/nebula-algorithm/application.conf
#--driver-class-path /home/bonelee/Desktop/nebula-algorithm/guava-14.0.jar \
#--driver-library-path /home/bonelee/Desktop/nebula-algorithm/guava-14.0.jar \
#--conf spark.executor.extraLibraryPath=/home/bonelee/Desktop/nebula-algorithm/guava-14.0.jar \
#--conf spark.executor.extraClassPath=/home/bonelee/Desktop/nebula-algorithm/guava-14.0.jar \
即便是使用: 文中说的方法,使用guava-14.0.jar依然会报错!
真郁闷,折腾了一下午。。。没有搞定。。。
绕过该问题的解决方法那就是:
修改application.conf文件数据源:
data: {
# data source. optional of nebula,csv,json
source: nebula
# data sink, means the algorithm result will be write into this sink. optional of nebula,csv,text
sink: csv
# if your algorithm needs weight
hasWeight: false
}
将上面的nebula修改为csv,同时配置下:
local: {
# algo's data source from Nebula. If data.source is csv or json, then this local.read can be valid.
read:{
filePath: "file:///tmp/algo_edge.csv"
srcId:"src"
# dstId column
dstId:"dst"
# weight column
weight: "weight"
# if csv file has header
header: true
# csv file's delimiter
delimiter:","
}
# algo result sink into local file. If data.sink is csv or text, then this local.write can be valid.
write:{
resultPath:/tmp/count
}
}
algo_edge.csv文件内容:
src,dst,weight
1,1,5.0
1,2,1.0
1,3,5.0
1,4,1.0
2,1,5.0
2,2,1.0
2,3,5.0
2,4,1.0
3,1,1.0
3,2,5.0
3,3,1.0
3,4,5.0
4,1,1.0
4,2,5.0
4,3,1.0
4,4,5.0
这个时候,是可以正常输出结果的!
bonelee@bonelee-VirtualBox:~/Desktop/nebula-algorithm-2.6$ cat /tmp/count/part-00000-7ae75bc1-80e7-4469-9dda-268a8036db09-c000.csv
count
4
----------------------------------------------------
20220613解决方法,是在windows下spark 2.4.7跑通的,我要跪了。。。
关键步骤:
1、linux下修改配置文件
将etc/下面的nebula-graphd.conf,nebula-storaged.conf,nebula-metad.conf里的127.0.0.1修改成外部访问地址,我的是:8.35.24.181
########## networking ##########
# Comma separated Meta Server Addresses
--meta_server_addrs=8.35.24.181:9559
# Local IP used to identify the nebula-graphd process.
# Change it to an address other than loopback if the service is distributed or
# will be accessed remotely.
--local_ip=8.35.24.181
然后记得add hosts,看到上线:
(root@nebula) [algo_edge]> show hosts
+---------------+------+-----------+-----------+--------------+----------------------+-----------------------------------+---------+
| Host | Port | HTTP port | Status | Leader count | Leader distribution | Partition distribution | Version |
+---------------+------+-----------+-----------+--------------+----------------------+-----------------------------------+---------+
| "127.0.0.1" | 9779 | 19669 | "OFFLINE" | 0 | "No valid partition" | "basketballplayer:10, cadets:100" | "3.1.0" |
| "8.35.24.181" | 9779 | 19669 | "ONLINE" | 10 | "algo_edge:10" | "algo_edge:10" | "3.1.0" |
+---------------+------+-----------+-----------+--------------+----------------------+-----------------------------------+---------+
Got 2 rows (time spent 3122/3893 us)
Mon, 13 Jun 2022 18:41:25 CST
2、图数据库里插入数据:
drop space algo_edge;
create space algo_edge(partition_num=10,replica_factor=1,vid_type=INT64);
:sleep 20
use algo_edge;
create tag player(id int);
create edge serve(serve_year int);
:sleep 20
insert vertex player(id) values 1:(1);
insert vertex player(id) values 2:(2);
insert vertex player(id) values 3:(3);
insert vertex player(id) values 4:(4);
insert edge serve(serve_year) values 1->1:(5);
insert edge serve(serve_year) values 1->2:(1);
insert edge serve(serve_year) values 1->3:(5);
insert edge serve(serve_year) values 1->4:(1);
insert edge serve(serve_year) values 2->1:(5);
insert edge serve(serve_year) values 2->2:(1);
insert edge serve(serve_year) values 2->3:(5);
insert edge serve(serve_year) values 2->4:(1);
insert edge serve(serve_year) values 3->1:(1);
insert edge serve(serve_year) values 3->2:(5);
insert edge serve(serve_year) values 3->3:(1);
insert edge serve(serve_year) values 3->4:(5);
insert edge serve(serve_year) values 4->1:(1);
insert edge serve(serve_year) values 4->2:(5);
insert edge serve(serve_year) values 4->3:(1);
insert edge serve(serve_year) values 4->4:(5);
3、修改app.conf, 尤其是nebula连接那部分:
{
# Spark relation config
spark: {
app: {
name: LPA
# spark.app.partitionNum
partitionNum:100
}
master:local
}
data: {
# data source. optional of nebula,csv,json
source: nebula
# data sink, means the algorithm result will be write into this sink. optional of nebula,csv,text
sink: csv
# if your algorithm needs weight
hasWeight: false
}
# Nebula Graph relation config
nebula: {
# algo's data source from Nebula. If data.source is nebula, then this nebula.read config can be valid.
read: {
# Nebula metad server address, multiple addresses are split by English comma
metaAddress: "8.35.24.181:9559"
# Nebula space
space: algo_edge
# Nebula edge types, multiple labels means that data from multiple edges will union together
labels: ["serve"]
# Nebula edge property name for each edge type, this property will be as weight col for algorithm.
# Make sure the weightCols are corresponding to labels.
weightCols: ["serve_year"]
}
# algo result sink into Nebula. If data.sink is nebula, then this nebula.write config can be valid.
write:{
# Nebula graphd server address, multiple addresses are split by English comma
graphAddress: "127.0.0.1:9669"
# Nebula metad server address, multiple addresses are split by English comma
metaAddress: "127.0.0.1:9559"
user:root
pswd:nebula
# Nebula space name
space:nb
# Nebula tag name, the algorithm result will be write into this tag
tag:pagerank
# algorithm result is insert into new tag or update to original tag. type: insert/update
type:insert
}
}
local: {
# algo's data source from Nebula. If data.source is csv or json, then this local.read can be valid.
read:{
filePath: "file:///D:/tmp/algo_edge.csv"
srcId:"src"
# dstId column
dstId:"dst"
# weight column
weight: "weight"
# if csv file has header
header: true
# csv file's delimiter
delimiter:","
}
# algo result sink into local file. If data.sink is csv or text, then this local.write can be valid.
write:{
resultPath: "file:///D:/tmp/result0613002"
}
}
algorithm: {
# the algorithm that you are going to execute,pick one from [pagerank, louvain, connectedcomponent,
# labelpropagation, shortestpaths, degreestatic, kcore, stronglyconnectedcomponent, trianglecount,
# betweenness, graphtriangleCount, clusteringcoefficient, bfs, hanp, closeness, jaccard, node2vec]
executeAlgo: graphtrianglecount
# PageRank parameter
pagerank: {
maxIter: 10
resetProb: 0.15 # default 0.15
}
# Louvain parameter
louvain: {
maxIter: 20
internalIter: 10
tol: 0.5
}
# connected component parameter.
connectedcomponent: {
maxIter: 20
}
# LabelPropagation parameter
labelpropagation: {
maxIter: 20
}
# ShortestPaths parameter
shortestpaths: {
# several vertices to compute the shortest path to all vertices.
landmarks: "1"
}
# Vertex degree statistics parameter
degreestatic: {}
# KCore parameter
kcore:{
maxIter:10
degree:1
}
# Trianglecount parameter
trianglecount:{}
# graphTriangleCount parameter
graphtrianglecount:{}
# Betweenness centrality parameter. maxIter parameter means the max times of iterations.
betweenness:{
maxIter:5
}
# Clustering Coefficient parameter. The type parameter has two choice, local or global
# local type will compute the clustering coefficient for each vertex, and print the average coefficient for graph.
# global type just compute the graph's clustering coefficient.
clusteringcoefficient:{
type: local
}
# ClosenessAlgo parameter
closeness:{}
# BFS parameter
bfs:{
maxIter:5
root:"10"
}
# HanpAlgo parameter
hanp:{
hopAttenuation:0.1
maxIter:10
preference:1.0
}
#Node2vecAlgo parameter
node2vec:{
maxIter: 10,
lr: 0.025,
dataNumPartition: 10,
modelNumPartition: 10,
dim: 10,
window: 3,
walkLength: 5,
numWalks: 3,
p: 1.0,
q: 1.0,
directed: false,
degree: 30,
embSeparate: ",",
modelPath: "hdfs://127.0.0.1:9000/model"
}
# JaccardAlgo parameter
jaccard:{
tol: 1.0
}
}
}
4、然后运行spark 就可以成功连接nebula数据源并运行了!!!
spark-submit --master "local" --class com.vesoft.nebula.algorithm.Main nebula-algorithm-3.0.0.jar -p c.conf
注意,我是安装的带hadoop的版本!
D:\app\spark-2.4.7-bin-hadoop2.7\bin\spark-submit