:Scala操作MongoDB(比较全)

 :使用用户名和密码进行连接。 

:Spark写入数据到mongDB

注意:casbah-core_2.10版本需要与scala版本保持一致。

本项目中:scala采用2.11,所以配置如下。

<dependency>
 <groupId>org.mongodb.spark</groupId>
 <artifactId>mongo-spark-connector_2.11</artifactId>
 <version>2.1.3</version>
 </dependency>

 <dependency>
     <groupId>org.mongodb</groupId>
     <artifactId>casbah-core_2.11</artifactId>
     <version>3.1.1</version></dependency>

mongodb和spark驱动 spark连接mongodb_spark

 

mongodb和spark驱动 spark连接mongodb_scala_02

2.spark shell链接数据库MongDB

一般默认的spark库中没有MongDB所需要的工具包,此时需要将对应的包打入jar包中,并且注意版本。

mongodb和spark驱动 spark连接mongodb_scala_03

如果数据库的版本打包错误,同样连接不上。 

mongodb和spark驱动 spark连接mongodb_scala_04

版本对应出错的例子1

Spark 2.1.0 - Caused by: java.lang.ClassNotFoundException: scala.collection.GenTraversableOnce$class

I am trying to use spark-submit with Spark 2.1.0 to read Avro files.

Maven dependencies:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.11</artifactId>
  <version>2.1.0</version>
</dependency>    
<dependency>
       <groupId>com.databricks</groupId>
       <artifactId>spark-avro_2.10</artifactId>
       <version>3.2.0</version>
</dependency>    
   <dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>2.11.8</version>
</dependency>    
   <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.1.0</version>
</dependency>

Getting the following exception:

Caused by: java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
        at com.databricks.spark.avro.DefaultSource$$anonfun$buildReader$1$$anon$1.<init>(DefaultSource.scala:205)
        at com.databricks.spark.avro.DefaultSource$$anonfun$buildReader$1.apply(DefaultSource.scala:205)
        at com.databricks.spark.avro.DefaultSource$$anonfun$buildReader$1.apply(DefaultSource.scala:160)
        at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:138)
        at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:122)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:150)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: scala.collection.GenTraversableOnce$class
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

The source of the problem is mixing Scala 2.10 (spark-avro_2.10) and Scala 2.11 libraries. If you use Scala 2.11 it should be:

<dependency>
   <groupId>com.databricks</groupId>
   <artifactId>spark-avro_2.11</artifactId>
   <version>3.2.0</version>
</dependency>

版本对应错的例子2

mongodb和spark驱动 spark连接mongodb_scala_05

mongodb和spark驱动 spark连接mongodb_spark_06

红色代码一直报错。

经过排查,scala使用的是2.11版本,但是如下3个相关的jar包却使用了2.10版本,修改版本号后,重新运行上述代码,调试OK。

mongodb和spark驱动 spark连接mongodb_java_07

 连接OK的例子

mongodb和spark驱动 spark连接mongodb_mongodb和spark驱动_08

注意RDD数据的读取要使用get

mongodb和spark驱动 spark连接mongodb_spark_09

1.Scala远程连接MongoDB读取数据


 使用用户名和密码远程连接MongoDB数据库,用Java和Scala连接其实原理相同,都是JDBC,用MongoDB的连接驱动,只是语法上稍有区别而已,而在类、方法的调用上一模一样。
      在此,分享一下Scala连接MongoDB查看数据的Code,语法结构上稍作修改就可以用Java实现。
首先,下载连接驱动,添加到工程里,下载地址:mongo-java-driver-3.7.0.jar
 

使用用户名和密码远程连接MongoDB数据库,用Java和Scala连接其实原理相同,都是JDBC,用MongoDB的连接驱动,只是语法上稍有区别而已,而在类、方法的调用上一模一样。
      在此,分享一下Scala连接MongoDB查看数据的Code,语法结构上稍作修改就可以用Java实现。
首先,下载连接驱动,添加到工程里,下载地址:mongo-java-driver-3.7.0.jar

import java.io.{File, PrintWriter}
 
import com.mongodb.{MongoClient, ServerAddress}
import com.mongodb.client.{FindIterable, MongoCollection, MongoCursor, MongoDatabase}
import org.bson.Document
import com.mongodb.MongoCredential
import java.util.ArrayList
 
 
 
object MondoDBtoES {
  def main(args: Array[String]): Unit = {
    //初始化和连接
    try {
      //两个变量分别为服务器地址和端口
      val serverAddress:ServerAddress = new ServerAddress("XX.XXX.XX.XX",27017)
      val adds = new ArrayList[ServerAddress]()
      adds.add(serverAddress)
      //用户名,连接的数据库名,用户密码
      val credential = MongoCredential.createCredential("user",  "test","passwd".toCharArray)
      val credentials = new ArrayList[MongoCredential]()
      credentials.add(credential)
      //通过验证连接到MongoDB客户端
      val mongoClient:MongoClient = new MongoClient(adds,credentials)
      //连接数据库
      val mongoDatabase:MongoDatabase = mongoClient.getDatabase("test")
      println("Connect MongoDB Successfully!!!")
 
 
      val collection:MongoCollection[Document] = mongoDatabase.getCollection("price_discount")
      println("已经选中集合price_discount")
      printf(collection.count().toString)
      //检索所有文档
      //获取迭代器
      val findInterable:FindIterable[Document] = collection.find()
      //获取游标
      val mongoCursor:MongoCursor[Document] = findInterable.iterator()
      //开始遍历
      while (mongoCursor.hasNext){
         println(mongoCursor.next())
      }
 
 
    } catch {
      case e:Exception=>
        println(e.getClass.getName + ": " + e.getMessage)
    }
 
  }
}

 2.方法2 直接在config接口配置,如下对应没有密码的情况。

 

mongodb和spark驱动 spark连接mongodb_scala_10

 有密码的情况下,直接要按照如下格式更改URI即可。

mongodb和spark驱动 spark连接mongodb_java_11

 mongo-spark-读取不同的库数据和写入不同的库中

参考:

package com.example.app

 import com.mongodb.spark.config.{ReadConfig, WriteConfig}
 import com.mongodb.spark.sql._

object App {


 def main(args: Array[String]): Unit = {

    val MongoUri1 = args(0).toString
    val MongoUri2 = args(1).toString
    val SparkMasterUri= args(2).toString

     def makeMongoURI(uri:String,database:String,collection:String) = (s"${uri}/${database}.${collection}")

   val mongoURI1 = s"mongodb://${MongoUri1}:27017"
   val mongoURI2 = s"mongodb://${MongoUri2}:27017"

   val CONFdb1 = makeMongoURI(s"${mongoURI1}","MyColletion1,"df")
   val CONFdb2 = makeMongoURI(s"${mongoURI2}","MyColletion2,"df")

   val WRITEdb1: WriteConfig =  WriteConfig(scala.collection.immutable.Map("uri"->CONFdb1))
   val READdb1: ReadConfig = ReadConfig(Map("uri" -> CONFdb1))

   val WRITEdb2: WriteConfig =  WriteConfig(scala.collection.immutable.Map("uri"->CONFdb2))
   val READdb2: ReadConfig = ReadConfig(Map("uri" -> CONFdb2))

   val spark = SparkSession
  .builder
  .appName("AppMongo")
  .config("spark.worker.cleanup.enabled", "true")
  .config("spark.scheduler.mode", "FAIR")
  .getOrCreate()

   val df1 = spark.read.mongo(READdb1)
   val df2 = spark.read.mongo(READdb2)
   df1.write.mode("overwrite").mongo(WRITEdb1)
   df2.write.mode("overwrite").mongo(WRITEdb2)
 }
}