kafka拉取数据时间间隔 kafka按时间段读取消息

转载

桃太郎 2024-06-22 17:39:37

文章标签 kafka拉取数据时间间隔 SparkStreaming Kafka scala 贷出模式 wordCount 文章分类 架构后端开发

引入第三方jar

<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka_2.11</artifactId>
            <version>1.6.0</version>
        </dependency>

代码：

下面来理解一下：贷出模式（Loan Pattern）

Scala 中的“贷出模式（Loan Pattern）”这个术语。我的理解，此一模式大概是说，对于那些资源密集（resource-intensive）型对象的使用应该使用这一模式。

使用这一模式的原因是，既然资源集中在一个对象中，那么用户代码就不能一直保持着获得的所有资源，而应该在需要时就向资源供给方进行借贷，使用完毕之后立即归还。

此外，Scala 中将函数也是对象，可以像参数那样传递给另一个函数的特征使得贷出模式更加有意义。客户代码借贷了所需的资源，接下来如何使用这些资源以完成特定的任务则由客户决定。就像我们向银行贷款，这些钱的具体用途是客户决定，也是客户才明确的。

对于此类资源，有数据库连接、IO操作等，这些是我们用完则务必立即释放的资源。而且，资源使用完毕业意味着将被自动回收，我们不必操心资源回收的过程。

下面是使用SparkStreaming 读取kafka主题的单词然后完成wordcount 的流处理，利用sparkOperation 方法中能够使用SparkStreaming 读取kafka主题的单词然后完成wordcount 的流处理的operation 对象。而具体如何使用这个 operation 对象则由用户定义（这里定义成：sparkOperation），使用完则由自动释放。

方法的定义使用到了 Scala 中的curring技巧，所以才会看到 sparkOperation 定义之后有两个参数列表sparkOperation(args:Array[String])(operation:(SparkContext,Array[String])（即两个小括号对），后面的参数 operation 用于传入客户具体想要的操作（即怎么使用资源）

客户代码如下：做什么完全由用户自己支配

/**
    * 贷出模式中的用户函数
    * 对spark的应用过来说这里面要做三件事情
    * 1.读取数据
    * 2.处理数据
    * 3.结果输出
    * @param sc
    * @param args
    */
  def processData(sc:SparkContext,args: Array[String]): Unit ={

    if (args.length < 4) {
      System.err.println("Usage: KafkaWordCount <zkQuorum><group> <topics> <numThreads>")
      System.exit(1)
    }
    val Array(zkQuorum, group, topics, numThreads) = args
    /**
      * step1. input data ==>> DStream
      * 读取数据
      */

    /**
      * step2. process data ==>>DStream#Transformation
      * 数据转换处理
      */

/**
      * step3. output data ==>>RDD#Output
      * 结果输出
      */
    
}

完整代码：

package ezr.bigdata.spark.hive.streaming

import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.joda.time.DateTime

/**
  * 基于scala语言的贷出模式，是编写spark的编程模板
  *  1. 贷出函数，一般写在main函数之前
  *  2. 用户函数，一般写在main函数之后
  *
  *         2019/7/30. 
  */
object KafKaStreamingWordCount {
  def sparkOperation(args:Array[String])(operation:(SparkContext,Array[String]) => Unit): Unit ={
    /**
      * 读取配置文件，
      * 这里会读取hdfs文件系统的配置信息
      * 所以如果本地idea测试执行
      *   我们要把hdfs-site.xml 和
      *             core-site.xml
      *             放到资源文件夹resources中
      *             然后才可以本地执行
      */

    val sparkConf = new SparkConf()
      .setAppName("KafKaStreamingWordCount")
      .setMaster("local[*]")
    val sc = new SparkContext(sparkConf)

    /**
      * 调用用户函数
      * proccessData(sc)
      * 我们不能直接在这里调用这个函数，
      * 可以把用户函数做为贷出函数的瘾式转换参数传进来
      */
    try{
      operation(sc,args)
    }finally{
      //sparkContex是资源，是资源就需要关闭
      sc.stop()
    }
  }
  /**
    * spark application running entry
    * 这也就是我们说的Driver Program
    * @param args
    */
  def main(args: Array[String]): Unit = {
    sparkOperation(args)(processData)
  }

  /**
    * 贷出模式中的用户函数
    * 对spark的应用过来说这里面要做三件事情
    * 1.读取数据
    * 2.处理数据
    * 3.结果输出
    * @param sc
    * @param args
    */
  def processData(sc:SparkContext,args: Array[String]): Unit ={

    if (args.length < 4) {
      System.err.println("Usage: KafkaWordCount <zkQuorum><group> <topics> <numThreads>")
      System.exit(1)
    }
    val Array(zkQuorum, group, topics, numThreads) = args
    /**
      * step1. input data ==>> DStream
      * 读取数据
      */
    //创建SparkStreamingContext,Seconds(10):流数据被分成批次的时间间隔10秒
    val ssc = new StreamingContext(sc, Seconds(10))
    ssc.checkpoint("checkpoint")
    val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap

    val lines: DStream[String] = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)

    /**
      * step2. process data ==>>DStream#Transformation
      * 数据转换处理
      */
    val words = lines.flatMap(_.split(" "))
    val wordCounts: DStream[(String, Long)] = words.map(x => (x, 1L)).reduceByKeyAndWindow(_+_,_-_,Seconds(20), Seconds(10),2)


    /**
      * step3. output data ==>>RDD#Output
      * 结果输出
      */

    wordCounts.print()
    //启动应用
    ssc.start()
    //等待结束应用
    ssc.awaitTermination()
  }
}

直接运行会出现异常：

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:91)
	at org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:66)
	at ezr.bigdata.spark.hive.streaming.KafKaStreamingWordCount$.processData(KafKaStreamingWordCount.scala:81)
	at ezr.bigdata.spark.hive.streaming.KafKaStreamingWordCount$$anonfun$main$1.apply(KafKaStreamingWordCount.scala:53)
	at ezr.bigdata.spark.hive.streaming.KafKaStreamingWordCount$$anonfun$main$1.apply(KafKaStreamingWordCount.scala:53)
	at ezr.bigdata.spark.hive.streaming.KafKaStreamingWordCount$.sparkOperation(KafKaStreamingWordCount.scala:41)
	at ezr.bigdata.spark.hive.streaming.KafKaStreamingWordCount$.main(KafKaStreamingWordCount.scala:53)
	at ezr.bigdata.spark.hive.streaming.KafKaStreamingWordCount.main(KafKaStreamingWordCount.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 20 more

因为Spark版本为2.0.0版本，然而在spark 1.5.2版本之后 org/apache/spark/Logging 已经被移除了。由于spark-streaming-kafka 1.6.0版本中使用到了logging，所以我们报出这么个异常，知道原因就有两个解决办法，都尝试过可行，具体如下

解决办法一：

在自己的项目里创建package ：org.apache.spark

然后这个package里放入SCALA 的 trait Logging

kafka拉取数据时间间隔 kafka按时间段读取消息_SparkStreaming

trait Logging代码：

package org.apache.spark

import org.apache.log4j.{LogManager, PropertyConfigurator}
import org.slf4j.{Logger, LoggerFactory}
import org.slf4j.impl.StaticLoggerBinder

import org.apache.spark.annotation.DeveloperApi
import org.apache.spark.util.Utils
/**
  *@author liuchangfu@easyretailpro.com
  *         2019/7/30.
  */


/**
  * :: DeveloperApi ::
  * Utility trait for classes that want to log data. Creates a SLF4J logger for the class and allows
  * logging messages at different levels using methods that only evaluate parameters lazily if the
  * log level is enabled.
  *
  * NOTE: DO NOT USE this class outside of Spark. It is intended as an internal utility.
  *       This will likely be changed or removed in future releases.
  */
@DeveloperApi
trait Logging {
  // Make the log field transient so that objects with Logging can
  // be serialized and used on another machine
  @transient private var log_ : Logger = null

  // Method to get the logger name for this object
  protected def logName = {
    // Ignore trailing $'s in the class names for Scala objects
    this.getClass.getName.stripSuffix("$")
  }

  // Method to get or create the logger for this object
  protected def log: Logger = {
    if (log_ == null) {
      initializeIfNecessary()
      log_ = LoggerFactory.getLogger(logName)
    }
    log_
  }

  // Log methods that take only a String
  protected def logInfo(msg: => String) {
    if (log.isInfoEnabled) log.info(msg)
  }

  protected def logDebug(msg: => String) {
    if (log.isDebugEnabled) log.debug(msg)
  }

  protected def logTrace(msg: => String) {
    if (log.isTraceEnabled) log.trace(msg)
  }

  protected def logWarning(msg: => String) {
    if (log.isWarnEnabled) log.warn(msg)
  }

  protected def logError(msg: => String) {
    if (log.isErrorEnabled) log.error(msg)
  }

  // Log methods that take Throwables (Exceptions/Errors) too
  protected def logInfo(msg: => String, throwable: Throwable) {
    if (log.isInfoEnabled) log.info(msg, throwable)
  }

  protected def logDebug(msg: => String, throwable: Throwable) {
    if (log.isDebugEnabled) log.debug(msg, throwable)
  }

  protected def logTrace(msg: => String, throwable: Throwable) {
    if (log.isTraceEnabled) log.trace(msg, throwable)
  }

  protected def logWarning(msg: => String, throwable: Throwable) {
    if (log.isWarnEnabled) log.warn(msg, throwable)
  }

  protected def logError(msg: => String, throwable: Throwable) {
    if (log.isErrorEnabled) log.error(msg, throwable)
  }

  protected def isTraceEnabled(): Boolean = {
    log.isTraceEnabled
  }

  private def initializeIfNecessary() {
    if (!Logging.initialized) {
      Logging.initLock.synchronized {
        if (!Logging.initialized) {
          initializeLogging()
        }
      }
    }
  }

  private def initializeLogging() {
    // Don't use a logger in here, as this is itself occurring during initialization of a logger
    // If Log4j 1.2 is being used, but is not initialized, load a default properties file
    val binderClass = StaticLoggerBinder.getSingleton.getLoggerFactoryClassStr
    // This distinguishes the log4j 1.2 binding, currently
    // org.slf4j.impl.Log4jLoggerFactory, from the log4j 2.0 binding, currently
    // org.apache.logging.slf4j.Log4jLoggerFactory
    val usingLog4j12 = "org.slf4j.impl.Log4jLoggerFactory".equals(binderClass)

    lazy val isInInterpreter: Boolean = {
      try {
        val interpClass = classForName("org.apache.spark.repl.Main")
        interpClass.getMethod("interp").invoke(null) != null
      } catch {
        case _: ClassNotFoundException => false
      }
    }
    def classForName(className: String): Class[_] = {
      Class.forName(className, true, getContextOrSparkClassLoader)
      // scalastyle:on classforname
    }
    def getContextOrSparkClassLoader: ClassLoader =
      Option(Thread.currentThread().getContextClassLoader).getOrElse(getSparkClassLoader)
    def getSparkClassLoader: ClassLoader = getClass.getClassLoader

    if (usingLog4j12) {
      val log4j12Initialized = LogManager.getRootLogger.getAllAppenders.hasMoreElements
      if (!log4j12Initialized) {
        // scalastyle:off println
        if (isInInterpreter) {
          val replDefaultLogProps = "org/apache/spark/log4j-defaults-repl.properties"
          Option(Utils.getSparkClassLoader.getResource(replDefaultLogProps)) match {
            case Some(url) =>
              PropertyConfigurator.configure(url)
              System.err.println(s"Using Spark's repl log4j profile: $replDefaultLogProps")
              System.err.println("To adjust logging level use sc.setLogLevel(\"INFO\")")
            case None =>
              System.err.println(s"Spark was unable to load $replDefaultLogProps")
          }
        } else {
          val defaultLogProps = "org/apache/spark/log4j-defaults.properties"
          Option(Utils.getSparkClassLoader.getResource(defaultLogProps)) match {
            case Some(url) =>
              PropertyConfigurator.configure(url)
              System.err.println(s"Using Spark's default log4j profile: $defaultLogProps")
            case None =>
              System.err.println(s"Spark was unable to load $defaultLogProps")
          }
        }
        // scalastyle:on println
      }
    }
    Logging.initialized = true

    // Force a call into slf4j to initialize it. Avoids this happening from multiple threads
    // and triggering this: http://mailman.qos.ch/pipermail/slf4j-dev/2010-April/002956.html
    log
  }
}

private object Logging {
  @volatile private var initialized = false
  val initLock = new Object()
  try {
    // We use reflection here to handle the case where users remove the
    // slf4j-to-jul bridge order to route their logs to JUL.
    val bridgeClass = Utils.classForName("org.slf4j.bridge.SLF4JBridgeHandler")
    bridgeClass.getMethod("removeHandlersForRootLogger").invoke(null)
    val installed = bridgeClass.getMethod("isInstalled").invoke(null).asInstanceOf[Boolean]
    if (!installed) {
      bridgeClass.getMethod("install").invoke(null)
    }
  } catch {
    case e: ClassNotFoundException => // can't log anything yet so just fail silently
  }
}

解决办法二：

更改依赖包为：

<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
            <version>2.0.0</version>
        </dependency>

kafka拉取数据时间间隔 kafka按时间段读取消息_Kafka_02