Spark Streaming

  • 基于Spark Streaming的流数据处理和分析
  • 一、流是什么
  • 二、Spark Streaming
  • 1、简介
  • 2、流数据处理框架
  • 3、内部工作流程
  • 三、StreamingContext
  • 1、创建
  • 2、入门 wordcount
  • 3、transform包装
  • 四、DStream
  • 1、概念
  • 2、Input DStreams与接收器(Receivers)
  • 3、Dstream创建(内建流式数据源)
  • 4、DStream支持的转换算子
  • 五、SparkStreaming编程实例
  • 1、HDFS
  • 2、Spark Streaming处理带状态的数据
  • 3、Spark Streaming整合Spark SQL
  • 六、Spark Streaming高级应用
  • 1、Spark Streaming整合Flume
  • (1)push
  • (2)poll
  • 2、Spark Streaming整合Kafka
  • (1)wordcount
  • (2)window
  • 七、Spark Streaming优化策略


基于Spark Streaming的流数据处理和分析

一、流是什么

  • 数据流
  • 数据的流入
  • 数据的处理
  • 数据的流出
  • 随处可见的数据流
  • 电商网站、日志服务器、社交网络和交通监控产生的大量实时数据
  • 流处理
  • 是一种允许用户在接收到的数据后的短时间内快速查询连续数据流和检测条件的技术
  • 流的好处
  • 它能够更快地提供洞察力,通常在毫秒到秒之间
  • 大部分数据的产生过程都是一个永无止境的事件流
  • 要进行批处理,需要存储它,在某个时间停止数据收集,并处理数据
  • 流处理自然适合时间序列数据和检测模式随时间推移
  • 流的应用环境
  • 股市监控
  • 交通监控
  • 计算机系统与网络监控
  • 监控生产线
  • 供应链优化
  • 入侵、监视和欺诈检测
  • 大多数智能设备应用
  • 上下文感知促销和广告
  • 流处理框架
  • Apache Spark Streaming ——二代
  • Apache Flink ——三代
  • Confluent
  • Apache Storm ——一代

二、Spark Streaming

1、简介

  • 是基于Spark Core API的扩展,用于流式数据处理
  • 支持多种数据源和多种输出
  • 高容错
  • 可扩展
  • 高流量
  • 低延时(Spark 2.3.1 延时1ms,之前100ms)

spark df在foreach中操作 foreach外的df spark df union_apache

2、流数据处理框架

  • 典型架构

spark df在foreach中操作 foreach外的df spark df union_apache_02

3、内部工作流程

  • 微批处理:输入->分批处理->结果集
  • 以离散流的形式传入数据(DStream:Discretized Streams)
  • 流被分成微批次(1-10s),每一个微批都是一个RDD

spark df在foreach中操作 foreach外的df spark df union_spark_03

三、StreamingContext

1、创建

  • Spark Streaming流处理的入口
  • 2.2版本SparkSession未整合StreamingContext,所以仍需单独创建
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val conf=new SparkConf().setMaster("local[2]").setAppName("kgc streaming demo")
val ssc=new StreamingContext(conf,Seconds(8)) 
    
//在spark-shell下,会出现如下错误提示:
//org.apache.spark.SparkException: Only one SparkContext may be running in this JVM
//解决:
//方法1、sc.stop    //创建ssc前,停止spark-shell自行启动的SparkContext
//方法2、或者通过已有的sc创建ssc:val ssc=new StreamingContext(sc,Seconds(8))

1、一个JVM只能有一个StreamingContext启动
2、StreamingContext停止后不能再启动

2、入门 wordcount

wordcount:

  • 单词统计——基于TCPSocket接收文本数据

开启kafka服务:

kafka-server-start.sh /opt/soft/kafka211/config/server.properties

pom:

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>2.3.4</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>2.3.4</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.11</artifactId>
      <version>2.3.4</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
      <version>2.3.4</version>
    </dependency>

log4j:关闭控制台输出

log4j.rootLogger=ERROR,stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%-20c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

object MyReadKafkaHandler {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("MyKafka")
    val sc = new SparkContext(conf)
    //流处理的上下文类
    val ssc = new StreamingContext(sc,Seconds(10))
    //创建连接kafka服务器参数
    val kafkaParam = Map(
      ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.253.150:9092",
      ConsumerConfig.GROUP_ID_CONFIG -> "mykafka1",
      ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "true",
      ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG -> "20000",
      ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
      ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
      ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "earliest"
    )
    //创建Direct流
    val streams = KafkaUtils.createDirectStream(ssc,LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe[String,String](Set("mydemo"),kafkaParam))
    //简单的数据处理 并打印
    streams.map(_.value).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()
    //启动sparkstreaming
    ssc.start()
    ssc.awaitTermination()
  }
}

测试输入指令:

kafka-console-producer.sh --broker-list 192.168.253.150:9092 --topic mydemo

spark df在foreach中操作 foreach外的df spark df union_spark_04

3、transform包装

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

object TransformDemo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("TransformDemo")
    val ssc: StreamingContext = new StreamingContext(conf,Seconds(5))
    val ds1 = ssc.socketTextStream("192.168.253.150",9999)
    //TODO 使用transform
    val transformDS = ds1.transform(
      rdd => {
       rdd.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
      }
    )
    transformDS.print()
    ssc.start()
    //等待程序终止
    ssc.awaitTermination()
  }

}

不包装:

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

object Demo01 {
  def main(args: Array[String]): Unit = {
    //TODO:创建一个spark StreamingContext对象
    val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("Demo01")
    val ssc: StreamingContext = new StreamingContext(conf,Seconds(5))
    //TODO:使用spark streaming来进行wordcount
    val inputDstream = ssc.socketTextStream("192.168.253.150",9999)
    //TODO:对输入的流进行操作
    // hadoop spark kafka
    val wordDstream: DStream[String] = inputDstream.flatMap(_.split(" "))
    val wordAndOneDstream: DStream[(String, Int)] = wordDstream.map((_,1))
    val wordcounts: DStream[(String, Int)] = wordAndOneDstream.reduceByKey(_+_)
    wordcounts.print()

    //TODO 通过start()启动消息采集和处理
    ssc.start()
    //TODO 等待程序终止
    ssc.awaitTermination()
  }
}

四、DStream

1、概念

  • 离散数据流(Discretized Stream)是Spark Streaming提供的高级别抽象
  • DStream代表了一系列连续的RDDs
  • 每个RDD都包含一个时间间隔内的数据
  • DStream既是输入的数据流,也是转换处理过的数据流
  • 对DStream的转换操作即是对具体RDD操作

spark df在foreach中操作 foreach外的df spark df union_大数据_05

2、Input DStreams与接收器(Receivers)

  • Input DStream指从某种流式数据源(Streaming Sources)接收流数据的DStream
  • 内建流式数据源:文件系统、Socket、Kafka、Flume……

spark df在foreach中操作 foreach外的df spark df union_apache_06

每一个Input DStream(file stream除外)都与一个接收器(Receiver)相关联,接收器是从数据源提取数据到内存的专用对象

3、Dstream创建(内建流式数据源)

  • 文件系统

def textFileStream(directory: String): DStream[String]

  • Socket

def socketTextStream(hostname: String, port: Int, storageLevel: StorageLevel): ReceiverInputDStream[String]

  • Flume Sink

val ds = FlumeUtils.createPollingStream(streamCtx, [sink hostname], [sink port]);

  • Kafka Consumer

val ds = KafkaUtils.createStream(streamCtx, zooKeeper, consumerGrp, topicMap);

4、DStream支持的转换算子

  • map,flatMap
  • filter
  • count, countByValue
  • repartition
  • union, join, cogroup
  • reduce, reduceByKey
  • transform
  • updateStateByKey
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

object TransformDemo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("TransformDemo")
    val ssc: StreamingContext = new StreamingContext(conf,Seconds(5))
    val ds1 = ssc.socketTextStream("192.168.253.150",9999)
    //TODO 使用transform
    val transformDS = ds1.transform(
      rdd => {
//        rdd.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
        rdd.map(
          x=>{
            x*3
          }
        )
      }
    )
    transformDS.print()
    
    val wordcounts = ds1.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

    ssc.start()
    //等待程序终止
    ssc.awaitTermination()
  }

}

五、SparkStreaming编程实例

1、HDFS

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object HDFSInputDStreamDemo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("HDFSDemo")
    val ssc: StreamingContext = new StreamingContext(conf,Seconds(5))

    //TODO: 创建一个输入流 读取文件系统上的数据
    val lines = ssc.textFileStream("hdfs://192.168.253.150:9000/data")
    lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
      .print()

    ssc.start()
    //等待程序终止
    ssc.awaitTermination()

  }

}
[root@zjw ~]# hdfs dfs -put /opt/data/exectest.txt /data
20/08/19 11:50:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

spark df在foreach中操作 foreach外的df spark df union_apache_07

2、Spark Streaming处理带状态的数据

  • 需求:计算到目前为止累计词频的个数
  • 分析:DStream转换操作包括无状态转换和有状态转换
  • 无状态转换:每个批次的处理不依赖于之前批次的数据
  • 有状态转换:当前批次的处理需要使用之前批次的数据
  • updateStateByKey属于有状态转换,可以跟踪状态的变化
  • 实现要点
  • 定义状态:状态数据可以是任意类型
  • 定义状态更新函数:参数为数据流之前的状态和新的数据流数据

代码实现:

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}


object UpdateStateByKeyDemo {
  def main(args: Array[String]): Unit = {
    //TODO:创建一个spark streamingContext
    val conf = new SparkConf().setMaster("local[*]").setAppName("UpdateStateByKey")
    val ssc = new StreamingContext(conf,Seconds(5))
    //TODO:创建一个input stream
    val input = ssc.socketTextStream("192.168.253.150",5678)
    val res1: DStream[(String, Int)] = input.flatMap(_.split(" ")).map((_,1))
    //TODO:做一个checkpoint
    ssc.checkpoint("src/data/checkpoint")

    //TODO:使用updateStateByKey
    //TODO: 首先创建一个updateFunc
    // (hello,1)(hello,1)
    def updateFunc(currentValue: Seq[Int],preValue:Option[Int])  = {
      val currsum = currentValue.sum
      val pre = preValue.getOrElse(0)
      Some(currsum+pre)
    }
    val state = res1.updateStateByKey(updateFunc)
    state.print()

    ssc.start()
    ssc.awaitTermination()
  }

}
-------------------------------------------
Time: 1597886075000 ms
-------------------------------------------
(zjw,1)

-------------------------------------------
Time: 1597886080000 ms
-------------------------------------------
(jds,1)
(zjw,3)

3、Spark Streaming整合Spark SQL

需求:使用Spark Streaming +Spark SQL完成WordCount
分析:将每个RDD转换为DataFrame

代码实现:

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.DStream

object SparkSQLSparkStreamingDemo {
  def main(args: Array[String]): Unit = {
    //TODO:创建一个spark StreamingContext对象
    val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("Demo01")
    val ssc: StreamingContext = new StreamingContext(conf,Seconds(5))
    //TODO:创建SparkSession的对象
    val spark = SparkSession.builder().config(conf).getOrCreate()
    import spark.implicits._

    //TODO:使用spark streaming来进行wordcount
    val inputDstream = ssc.socketTextStream("192.168.253.150",9999)
    //TODO:对输入的流进行操作
    // hadoop spark kafka
    val wordDStream: DStream[String] = inputDstream.flatMap(_.split(" "))
    wordDStream.foreachRDD(
      rdd=>{
        if (rdd.count()!=0){
          val df1 = rdd.map(x=>Word(x)).toDF()
          df1.createOrReplaceTempView("words")
          spark.sql(
            """
              |select word,count(*)
              |from words
              |group by word
              |""".stripMargin).show()
        }
      }
    )
    ssc.start()
    //等待程序终止
    ssc.awaitTermination()
  }
}
case class Word(word:String)
+-----+--------+
| word|count(1)|
+-----+--------+
|hello|       2|
|  zjw|       1|
+-----+--------+

六、Spark Streaming高级应用

1、Spark Streaming整合Flume

(1)push

maven:

<dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-flume_2.11</artifactId>
      <version>2.3.4</version>
    </dependency>
import org.apache.spark.SparkConf
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkFlumePushDemo {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[2]").setAppName("flumeDemo01")
    val ssc = new StreamingContext(conf,Seconds(5))

    //TODO:push方式
    val flumeStream = FlumeUtils.createStream(ssc,"zjw",55555)

    flumeStream.map(x=>new String(x.event.getBody.array()).trim)
      .flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()

    ssc.start()
    ssc.awaitTermination()
  }
}

flume配置:

agent.sources=s1
agent.channels=c1
agent.sinks=sk1

agent.sources.s1.type=netcat
agent.sources.s1.bind=zjw
agent.sources.s1.port=44444
agent.sources.s1.channels=c1


agent.channels.c1.type=memory
agent.channels.c1.capacity=1000

agent.sinks.sk1.type=avro
agent.sinks.sk1.hostname=zjw
agent.sinks.sk1.port=55555
agent.sinks.sk1.channel=c1

先启动程序:

spark-submit  \
--class day0819.test07.SparkFlumePushDemo  \
--packages org.apache.spark:spark-streaming-flume_2.11:2.3.4 \
/opt/jars/SparkStreaming_Beijing-1.0-SNAPSHOT.jar

然后启动flume:

[root@zjw ~]# flume-ng agent -f /opt/soft/flume160/flumeconf/spark_Streaming_flume.properties -n agent

最后启动netcat:

[root@zjw ~]# nc 192.168.253.150 44444

spark df在foreach中操作 foreach外的df spark df union_apache_08

(2)poll

代码:

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}

object SparkFlumePollDemo {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("flumedemo01").setMaster("local[2]")
    val ssc = new StreamingContext(conf, Seconds(5))

    //poll方式
    val flumeStream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createPollingStream(ssc, "zhangqi", 55555)


    flumeStream.map(x=>new String(x.event.getBody.array()).trim)
      .flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()

    ssc.start()
    ssc.awaitTermination()
  }
}

flume配置:

agent.sources = s1    
agent.channels = c1  
agent.sinks = sk1  
  
#设置Source的内省为netcat,使用的channel为c1  
agent.sources.s1.type = netcat  
agent.sources.s1.bind = zjw  
agent.sources.s1.port = 44444  
agent.sources.s1.channels = c1  
  

#SparkSink,要求flume lib目录存在spark-streaming-flume-sink_2.11-x.x.x.jar
agent.sinks.sk1.type=org.apache.spark.streaming.flume.sink.SparkSink
agent.sinks.sk1.hostname=zjw
agent.sinks.sk1.port=55555
agent.sinks.sk1.channel = c1  
#设置channel信息  
#内存模式 
agent.channels.c1.type = memory 
agent.channels.c1.capacity = 1000

操作顺序同上。

2、Spark Streaming整合Kafka

(1)wordcount

在Spark Streaming中使用Direct方式接收Kafak主题数据,随后使用DStream完成词频统计

pom:

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
      <version>2.3.4</version>
    </dependency>
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkKafkaDirectDemo {
  def main(args: Array[String]): Unit = {

    val conf: SparkConf = new SparkConf().setAppName("kafkaDemo").setMaster("local[2]")
    val ssc = new StreamingContext(conf, Seconds(5))

    //TODO:SparkStreaming消费kafka数据
    //  val kafkaParams = Map(
    //    ("bootstrap.servers", "192.168.253.150:9092"),
    //    ("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"),
    //    ("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"),
    //    ("group.id", "testGroup1")
    //  )
    val kafkaParams = Map(
      (ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.253.150:9092"),
      (ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer"),
      (ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer"),
      (ConsumerConfig.GROUP_ID_CONFIG, "kafkaGroup01")
    )


    val message: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(ssc,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe(Set("testPartition2"), kafkaParams)
    )


    message.map(x=>x.value()).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).print()

    ssc.start()
    ssc.awaitTermination()

  }
}
(2)window

代码:

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}

object SparkStreamingWindowDemo {
  def main(args: Array[String]): Unit = {
    //创建一个spark streamingContext对象

    //线程数必须大于1,接收器需要1个线程,数据的处理也需要1个线程,local[n>1]
    val conf: SparkConf = new SparkConf().setAppName("demo01").setMaster("local[2]")
    val ssc = new StreamingContext(conf, Seconds(5))

    //使用spark Streaming来进行wordcount
    val inputDstream: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.253.150", 9998)

    //虚拟机开启端口:nc -lk 9998

    //对输入的流进行操作
    //hadoop spark kafka

    val wordDsteam: DStream[String] = inputDstream.flatMap(_.split(" "))
    val wordAndOneDsteam: DStream[(String, Int)] = wordDsteam.map((_, 1))

    val windowRS: DStream[(String, Int)] = wordAndOneDsteam.reduceByKeyAndWindow(
      (a: Int, b: Int) => (a + b),
      Seconds(15),
      Seconds(10)
    )
    windowRS.print()

    //通过start()启动消息采集和处理
    ssc.start()
    //等待程序终止
    ssc.awaitTermination()
  }

}
[root@zjw jars]# nc -lk 9998

七、Spark Streaming优化策略

  • 减少批处理时间
  • 数据接收并发度
  • 数据处理并发度
  • 任务启动开销
  • 设置合适的批次间隔
  • 内存调优
  • DStream持久化级别
  • 清除老数据
  • CMS垃圾回收器