引入第三方jar
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.0</version>
</dependency>
代码:
下面来理解一下:贷出模式(Loan Pattern)
Scala 中的“贷出模式(Loan Pattern)”这个术语。我的理解,此一模式大概是说,对于那些资源密集(resource-intensive)型对象的使用应该使用这一模式。
使用这一模式的原因是,既然资源集中在一个对象中,那么用户代码就不能一直保持着获得的所有资源,而应该在需要时就向资源供给方进行借贷,使用完毕之后立即归还。
此外,Scala 中将函数也是对象,可以像参数那样传递给另一个函数的特征使得贷出模式更加有意义。客户代码借贷了所需的资源,接下来如何使用这些资源以完成特定的任务则由客户决定。就像我们向银行贷款,这些钱的具体用途是客户决定,也是客户才明确的。
对于此类资源,有数据库连接、IO操作等,这些是我们用完则务必立即释放的资源。而且,资源使用完毕业意味着将被自动回收,我们不必操心资源回收的过程。
下面是使用SparkStreaming 读取kafka主题的单词然后完成wordcount 的流处理,利用sparkOperation 方法中能够使用SparkStreaming 读取kafka主题的单词然后完成wordcount 的流处理的operation 对象。而具体如何使用这个 operation 对象则由用户定义(这里定义成:sparkOperation),使用完则由自动释放。
方法的定义使用到了 Scala 中的curring技巧,所以才会看到 sparkOperation 定义之后有两个参数列表sparkOperation(args:Array[String])(operation:(SparkContext,Array[String])(即两个小括号对),后面的参数 operation 用于传入客户具体想要的操作(即怎么使用资源)
客户代码如下:做什么完全由用户自己支配
/**
* 贷出模式中的用户函数
* 对spark的应用过来说这里面要做三件事情
* 1.读取数据
* 2.处理数据
* 3.结果输出
* @param sc
* @param args
*/
def processData(sc:SparkContext,args: Array[String]): Unit ={
if (args.length < 4) {
System.err.println("Usage: KafkaWordCount <zkQuorum><group> <topics> <numThreads>")
System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads) = args
/**
* step1. input data ==>> DStream
* 读取数据
*/
/**
* step2. process data ==>>DStream#Transformation
* 数据转换处理
*/
/**
* step3. output data ==>>RDD#Output
* 结果输出
*/
}
完整代码:
package ezr.bigdata.spark.hive.streaming
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.joda.time.DateTime
/**
* 基于scala语言的贷出模式,是编写spark的编程模板
* 1. 贷出函数,一般写在main函数之前
* 2. 用户函数,一般写在main函数之后
*
* 2019/7/30.
*/
object KafKaStreamingWordCount {
def sparkOperation(args:Array[String])(operation:(SparkContext,Array[String]) => Unit): Unit ={
/**
* 读取配置文件,
* 这里会读取hdfs文件系统的配置信息
* 所以如果本地idea测试执行
* 我们要把hdfs-site.xml 和
* core-site.xml
* 放到资源文件夹resources中
* 然后才可以本地执行
*/
val sparkConf = new SparkConf()
.setAppName("KafKaStreamingWordCount")
.setMaster("local[*]")
val sc = new SparkContext(sparkConf)
/**
* 调用用户函数
* proccessData(sc)
* 我们不能直接在这里调用这个函数,
* 可以把用户函数做为贷出函数的瘾式转换参数传进来
*/
try{
operation(sc,args)
}finally{
//sparkContex是资源,是资源就需要关闭
sc.stop()
}
}
/**
* spark application running entry
* 这也就是我们说的Driver Program
* @param args
*/
def main(args: Array[String]): Unit = {
sparkOperation(args)(processData)
}
/**
* 贷出模式中的用户函数
* 对spark的应用过来说这里面要做三件事情
* 1.读取数据
* 2.处理数据
* 3.结果输出
* @param sc
* @param args
*/
def processData(sc:SparkContext,args: Array[String]): Unit ={
if (args.length < 4) {
System.err.println("Usage: KafkaWordCount <zkQuorum><group> <topics> <numThreads>")
System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads) = args
/**
* step1. input data ==>> DStream
* 读取数据
*/
//创建SparkStreamingContext,Seconds(10):流数据被分成批次的时间间隔10秒
val ssc = new StreamingContext(sc, Seconds(10))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines: DStream[String] = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
/**
* step2. process data ==>>DStream#Transformation
* 数据转换处理
*/
val words = lines.flatMap(_.split(" "))
val wordCounts: DStream[(String, Long)] = words.map(x => (x, 1L)).reduceByKeyAndWindow(_+_,_-_,Seconds(20), Seconds(10),2)
/**
* step3. output data ==>>RDD#Output
* 结果输出
*/
wordCounts.print()
//启动应用
ssc.start()
//等待结束应用
ssc.awaitTermination()
}
}
直接运行会出现异常:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91)
at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:66)
at ezr.bigdata.spark.hive.streaming.KafKaStreamingWordCount$.processData(KafKaStreamingWordCount.scala:81)
at ezr.bigdata.spark.hive.streaming.KafKaStreamingWordCount$$anonfun$main$1.apply(KafKaStreamingWordCount.scala:53)
at ezr.bigdata.spark.hive.streaming.KafKaStreamingWordCount$$anonfun$main$1.apply(KafKaStreamingWordCount.scala:53)
at ezr.bigdata.spark.hive.streaming.KafKaStreamingWordCount$.sparkOperation(KafKaStreamingWordCount.scala:41)
at ezr.bigdata.spark.hive.streaming.KafKaStreamingWordCount$.main(KafKaStreamingWordCount.scala:53)
at ezr.bigdata.spark.hive.streaming.KafKaStreamingWordCount.main(KafKaStreamingWordCount.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 20 more
因为Spark版本为2.0.0版本,然而在spark 1.5.2版本之后 org/apache/spark/Logging 已经被移除了。 由于spark-streaming-kafka 1.6.0版本中使用到了logging,所以我们报出这么个异常,知道原因就有两个解决办法,都尝试过可行,具体如下
解决办法一:
在自己的项目里创建package :org.apache.spark
然后这个package里放入SCALA 的 trait Logging
trait Logging代码:
package org.apache.spark
import org.apache.log4j.{LogManager, PropertyConfigurator}
import org.slf4j.{Logger, LoggerFactory}
import org.slf4j.impl.StaticLoggerBinder
import org.apache.spark.annotation.DeveloperApi
import org.apache.spark.util.Utils
/**
*@author liuchangfu@easyretailpro.com
* 2019/7/30.
*/
/**
* :: DeveloperApi ::
* Utility trait for classes that want to log data. Creates a SLF4J logger for the class and allows
* logging messages at different levels using methods that only evaluate parameters lazily if the
* log level is enabled.
*
* NOTE: DO NOT USE this class outside of Spark. It is intended as an internal utility.
* This will likely be changed or removed in future releases.
*/
@DeveloperApi
trait Logging {
// Make the log field transient so that objects with Logging can
// be serialized and used on another machine
@transient private var log_ : Logger = null
// Method to get the logger name for this object
protected def logName = {
// Ignore trailing $'s in the class names for Scala objects
this.getClass.getName.stripSuffix("$")
}
// Method to get or create the logger for this object
protected def log: Logger = {
if (log_ == null) {
initializeIfNecessary()
log_ = LoggerFactory.getLogger(logName)
}
log_
}
// Log methods that take only a String
protected def logInfo(msg: => String) {
if (log.isInfoEnabled) log.info(msg)
}
protected def logDebug(msg: => String) {
if (log.isDebugEnabled) log.debug(msg)
}
protected def logTrace(msg: => String) {
if (log.isTraceEnabled) log.trace(msg)
}
protected def logWarning(msg: => String) {
if (log.isWarnEnabled) log.warn(msg)
}
protected def logError(msg: => String) {
if (log.isErrorEnabled) log.error(msg)
}
// Log methods that take Throwables (Exceptions/Errors) too
protected def logInfo(msg: => String, throwable: Throwable) {
if (log.isInfoEnabled) log.info(msg, throwable)
}
protected def logDebug(msg: => String, throwable: Throwable) {
if (log.isDebugEnabled) log.debug(msg, throwable)
}
protected def logTrace(msg: => String, throwable: Throwable) {
if (log.isTraceEnabled) log.trace(msg, throwable)
}
protected def logWarning(msg: => String, throwable: Throwable) {
if (log.isWarnEnabled) log.warn(msg, throwable)
}
protected def logError(msg: => String, throwable: Throwable) {
if (log.isErrorEnabled) log.error(msg, throwable)
}
protected def isTraceEnabled(): Boolean = {
log.isTraceEnabled
}
private def initializeIfNecessary() {
if (!Logging.initialized) {
Logging.initLock.synchronized {
if (!Logging.initialized) {
initializeLogging()
}
}
}
}
private def initializeLogging() {
// Don't use a logger in here, as this is itself occurring during initialization of a logger
// If Log4j 1.2 is being used, but is not initialized, load a default properties file
val binderClass = StaticLoggerBinder.getSingleton.getLoggerFactoryClassStr
// This distinguishes the log4j 1.2 binding, currently
// org.slf4j.impl.Log4jLoggerFactory, from the log4j 2.0 binding, currently
// org.apache.logging.slf4j.Log4jLoggerFactory
val usingLog4j12 = "org.slf4j.impl.Log4jLoggerFactory".equals(binderClass)
lazy val isInInterpreter: Boolean = {
try {
val interpClass = classForName("org.apache.spark.repl.Main")
interpClass.getMethod("interp").invoke(null) != null
} catch {
case _: ClassNotFoundException => false
}
}
def classForName(className: String): Class[_] = {
Class.forName(className, true, getContextOrSparkClassLoader)
// scalastyle:on classforname
}
def getContextOrSparkClassLoader: ClassLoader =
Option(Thread.currentThread().getContextClassLoader).getOrElse(getSparkClassLoader)
def getSparkClassLoader: ClassLoader = getClass.getClassLoader
if (usingLog4j12) {
val log4j12Initialized = LogManager.getRootLogger.getAllAppenders.hasMoreElements
if (!log4j12Initialized) {
// scalastyle:off println
if (isInInterpreter) {
val replDefaultLogProps = "org/apache/spark/log4j-defaults-repl.properties"
Option(Utils.getSparkClassLoader.getResource(replDefaultLogProps)) match {
case Some(url) =>
PropertyConfigurator.configure(url)
System.err.println(s"Using Spark's repl log4j profile: $replDefaultLogProps")
System.err.println("To adjust logging level use sc.setLogLevel(\"INFO\")")
case None =>
System.err.println(s"Spark was unable to load $replDefaultLogProps")
}
} else {
val defaultLogProps = "org/apache/spark/log4j-defaults.properties"
Option(Utils.getSparkClassLoader.getResource(defaultLogProps)) match {
case Some(url) =>
PropertyConfigurator.configure(url)
System.err.println(s"Using Spark's default log4j profile: $defaultLogProps")
case None =>
System.err.println(s"Spark was unable to load $defaultLogProps")
}
}
// scalastyle:on println
}
}
Logging.initialized = true
// Force a call into slf4j to initialize it. Avoids this happening from multiple threads
// and triggering this: http://mailman.qos.ch/pipermail/slf4j-dev/2010-April/002956.html
log
}
}
private object Logging {
@volatile private var initialized = false
val initLock = new Object()
try {
// We use reflection here to handle the case where users remove the
// slf4j-to-jul bridge order to route their logs to JUL.
val bridgeClass = Utils.classForName("org.slf4j.bridge.SLF4JBridgeHandler")
bridgeClass.getMethod("removeHandlersForRootLogger").invoke(null)
val installed = bridgeClass.getMethod("isInstalled").invoke(null).asInstanceOf[Boolean]
if (!installed) {
bridgeClass.getMethod("install").invoke(null)
}
} catch {
case e: ClassNotFoundException => // can't log anything yet so just fail silently
}
}
解决办法二:
更改依赖包为:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.0.0</version>
</dependency>