Apache Spark Day4
Spark流计算
概述
一般流式计算会与批量计算相比较。在流式计算模型中,输入是持续的,可以认为在时间上是无界的,也就意味着,永远拿不到全量数据去做计算。同时,计算结果是持续输出的,也即计算结果在时间上也是无界的。流式计算一般对实时性要求较高,同时一般是先定义目标计算,然后数据到来之后将计算逻辑应用于数据。同时为了提高计算效率,往往尽可能采用增量计算代替全量计算。批量处理模型中,一般先有全量数据集,然后定义计算逻辑,并将计算应用于全量数据。特点是全量计算,并且计算结果一次性全量输出。
批处理 VS 流处理区别
类别 | 数据类型 | 数据级别 | 延迟时间 | 计算类型 | 场景 |
批计算 | 静态 | GB+级别 | 30分钟或者几个小时 | 最终停止 | 离线 |
流计算 | 持续动态 | 一条记录,几百字节 | 毫秒或者亚秒 | 7*24小时 | 在线 |
目前主流流处理框架:Kafka Streaming、Storm(JStrom)、Spark Streaming 、Flink(BLink)
- Kafka Streaming:是一套基于Kafka-Streaming库的一套流计算工具jar包,具有入门门槛低,简单容易集成等特点。
- Apache Storm:一款纯粹的流计算引擎,能够达到每秒钟百万级别数据的低延迟处理框架。
- Spark Streaming:是构建在Spark 批处理之上一款流处理框架。与批处理不同的是,流处理计算的数据是无界数据流,输出也是持续的。Spark Streaming底层将Spark RDD Batch 拆分成 Macro RDD Batch实现类似流处理的功能。因此spark Streaming在微观上依旧是批处理框架。
- Flink DataStream:在实时性和应用性上以及性能都有很大的提升,是目前为止最火热的流计算引擎。
快速入门
- 导入依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.5</version>
</dependency>
- 编写Driver
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkWordCountTopology {
def main(args: Array[String]): Unit = {
//1.创建流计算所需的环境
val conf = new SparkConf()
.setAppName("SparkWordCountTopology")
.setMaster("spark://CentOS:7077")
val streamingContext = new StreamingContext(conf, Seconds(1))
//2.创建持续输入DStream 细化
val lines: DStream[String] = streamingContext.socketTextStream("CentOS", 9999)
//3.对离散流进行转换 细化
val result:DStream[(String,Int)] = lines.flatMap(line => line.split("\\s+"))
.map(word => (word, 1))
.reduceByKey((v1, v2) => v1 + v2)
//4.将计算结果输出到控制台
result.print()
//5.启动流计算
streamingContext.start()
streamingContext.awaitTermination()
}
}
- 使用mvn package进行打包
<build>
<plugins>
<!--scala编译插件-->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>4.0.1</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<!--创建fatjar插件-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
<!--编译插件-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
- 下载nc组件
[root@CentOS ~]# yum -y install nmap-ncat
- 启动nc服务
[root@CentOS ~]# nc -lk 9999
- 启动服务
[root@CentOS spark-2.4.5]# ./bin/spark-submit --master spark://CentOS:7077 --name SparkWordCountTopology --deploy-mode client --class com.baizhi.quickstart.SparkWordCountTopology --total-executor-cores 6 /root/spark-dstream-1.0-SNAPSHOT.jar
- 用户可以访问webUI
Discretized Streams
Discretized Stream或DStream是Spark Streaming提供的基本抽象。它表示连续的数据流,可以是从源接收的输入数据流,也可以是通过转换输入流生成的已处理数据流。在内部,DStream由一系列连续的RDD表示,这是Spark对不可变分布式数据集的抽象。DStream中的每个RDD都包含来自特定时间间隔的数据,如下图所示。
应用于DStream的任何操作都转换为底层RDD上的操作。例如,在先前快速入门
示例中,flatMap操作应用于行DStream中的每个RDD以生成单词DStream的RDD。如下图所示。
注意:通过对DStream底层运行机制的了解,在设计StreamingContext的时候要求设置的Seconds()
间隔要略大于 微批的计算时间。这样才可以有效的避免数据在Spark的内存中产生积压。
DStreams & Receivers
每个输入DStream(file stream除外,稍后讨论)都与Receiver对象相关联,该对象从源接收数据并将其存储在Spark的内存中进行处理。Spark Streaming提供了两类内建的输入源,用于接收外围系统数据:
内建输入源
- Basic sources- Spark 的中StreamContext的API可以直接获取的数据源,例如:fileStream(读文件)、socket(测试)
socketTextStream
val lines: DStream[String] = ssc.socketTextStream("CentOS", 9999)
File Streams
val lines: DStream[String] = ssc.textFileStream("hdfs://CentOS:9000/words/src")
或者
val lines: DStream[(LongWritable,Text)] = ssc.fileStream[LongWritable,Text,TextInputFormat]("hdfs://CentOS:9000/words/src")
根据时间监测
hdfs://CentOS:9000/words
是否有新文件
产生,如果有新文件产生,系统就自动读取该文件。并不会监控文件内容的变化。提示:在测试的时候,一定要注意同步时间。
- Queue of RDDs(测试)
val queueRDDS=new Queue[RDD[String]]();
//产生测试数据
new Thread(new Runnable {
override def run(): Unit = {
while(true){
//往队列添加测试数据
queueRDDS += ssc.sparkContext.makeRDD(List("this is a demo","hello hello"))
Thread.sleep(500)
}
}
}).start()
//2.创建持续输入DStream 细化
val lines: DStream[String] = ssc.queueStream(queueRDDS)
高级数据源
- Advanced sources-并不是Spark自带的输入源,例如:Kafka、Flume、Kinesis等,这些一般都需要第三方支持。
Custom Receiver
class CustomReceiver (values:List[String]) extends Receiver[String](StorageLevel.MEMORY_ONLY) with Logging {
override def onStart(): Unit = {
new Thread(new Runnable {
override def run(): Unit = receive()
} ).start()
}
override def onStop(): Unit = {}
//接收来自外围系统的数据
private def receive(): Unit = {
try {
while (!isStopped()){
Thread.sleep(500)
var line= values(new Random().nextInt(values.length))
//随机写出去
store(line)
}
restart("Trying to restart again")
} catch {
case t: Throwable =>
restart("Error trying to restart again", t)
}
}
}
var arrays=List("this is a demo","good good ","study come on")
val lines: DStream[String] = ssc.receiverStream(new CustomReceiver(arrays))
Spark对接Kafka
参考资料:http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.4.5</version>
</dependency>
//1.创建流计算所需的环境
val conf = new SparkConf()
.setAppName("SparkWordCountTopology")
//.setMaster("spark://CentOS:7077")
.setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(1))
ssc.sparkContext.setLogLevel("FATAL")
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "CentOS:9092",
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG-> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
"group.id" -> "g1",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("topic01")
val kafkaInputs: DStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](ssc,
LocationStrategies.PreferConsistent, //设置读取位置策略
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)//订阅方式
)
kafkaInputs.map(record=>record.value())
.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.reduceByKey((v1,v2)=>v1+v2)
.print()
//5.启动流计算
ssc.start()
ssc.awaitTermination()
Transformations
DStream转换与RDD的转换类似,将DStream转换成新的DStream.DStream常见的许多算子使用和Spark RDD保持一致。
转换 | 含义 |
map(func) | Return a new DStream by passing each element of the source DStream through a function func. |
flatMap(func) | Similar to map, but each input item can be mapped to 0 or more output items. |
filter(func) | Return a new DStream by selecting only the records of the source DStream on which func returns true. |
repartition(numPartitions) | Changes the level of parallelism in this DStream by creating more or fewer partitions. |
union(otherStream) | Return a new DStream that contains the union of the elements in the source DStream and otherDStream. |
count() | Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream. |
reduce(func) | Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel. |
countByValue() | When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream. |
reduceByKey(func, [numTasks]) | When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property |
join(otherStream, [numTasks]) | When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key. |
cogroup(otherStream, [numTasks]) | When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples. |
transform(func) | Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream. |
updateStateByKey(func) | Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key. |
map算子
//1,zhangsan,true
lines.map(line=> line.split(","))
.map(words=>(words(0).toInt,words(1),words(2).toBoolean))
.print()
flatMap
//hello spark
lines.flatMap(line=> line.split("\\s+"))
.map((_,1)) //(hello,1)(spark,1)
.print()
filter
//只会对含有hello的数据过滤
lines.filter(line => line.contains("hello"))
.flatMap(line=> line.split("\\s+"))
.map((_,1))
.print()
repartition(修改分区)
lines.repartition(10) //修改程序并行度 分区数
.filter(line => line.contains("hello"))
.flatMap(line=> line.split("\\s+"))
.map((_,1))
.print()
union(将两个流合并)
val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream1.union(stream2).repartition(10)
.filter(line => line.contains("hello"))
.flatMap(line=> line.split("\\s+"))
.map((_,1))
.print()
count
val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream1.union(stream2).repartition(10)
.flatMap(line=> line.split("\\s+"))
.count() //计算微批处中RDD元素的个数
.print()
reduce(func)
aa bb
val stream: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream.flatMap(line=> line.split("\\s+"))
.reduce(_+"|"+_)
.print() // aa|bb
countByValue(key计数)
val stream: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream.repartition(10) // a a b c
.flatMap(line=> line.split("\\s+"))
.countByValue() //(a,2) (b,1) (c,1)
.print()
reduceByKey(func, [numTasks])
var lines:DStream[String]=ssc.socketTextStream("CentOS",9999) //this is spark this
lines.repartition(10)
.flatMap(line=> line.split("\\s+").map((_,1)))
.reduceByKey(_+_)// (this,2)(is,1)(spark ,1)
.print()
join(otherStream, [numTasks])
//1 zhangsan
val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
//1 apple 1 4.5
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
val userPair:DStream[(String,String)]=stream1.map(line=>{
var tokens= line.split(" ")
(tokens(0),tokens(1))
})
val orderItemPair:DStream[(String,(String,Double))]=stream2.map(line=>{
val tokens = line.split(" ")
(tokens(0),(tokens(1),tokens(2).toInt * tokens(3).toDouble))
})
userPair.join(orderItemPair) //(key,(用户,订单项))
.map(t=>(t._1,t._2._1,t._2._2._1,t._2._2._2))//1 zhangsan apple 4.5
.print()
必须保证两个流需要join的数据落入同一个RDD批次,否则无法完成
join
,因此意义不大。
transform
可以使用stream和RDD做计算,因为transform可以拿到底层macro batch RDD,继而实现stream-batch join
//1 apple 2 4.5
val orderLog: DStream[String] = ssc.socketTextStream("CentOS",8888)
var userRDD=ssc.sparkContext.makeRDD(List(("1","zhangsan"),("2","wangwu"))) //静态
val orderItemPair:DStream[(String,(String,Double))]=orderLog.map(line=>{
val tokens = line.split(" ")
(tokens(0),(tokens(1),tokens(2).toInt * tokens(3).toDouble))
})
orderItemPair.transform(rdd=> rdd.join(userRDD))
.print()//(1,((apple,13.799999999999999),zhansgan))
updateStateByKey(有状态计算,全量输出)
ssc.checkpoint("hdfs://CentOS:9000/spark-checkpoint")//状态快照
val lines: DStream[String] = ssc.socketTextStream("CentOS",9999)
def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
val newCount = newValues.sum + runningCount.getOrElse(0)
Some(newCount)
}
lines.flatMap(_.split("\\s+"))
.map((_,1))
.updateStateByKey(updateFunction)
.print()
必须设定checkpointdir用于存储程序的状态信息。对内存消耗比较严重。
mapWithState(有状态计算,增量输出)
ssc.checkpoint("hdfs://CentOS:9000/spark-checkpoint")//状态快照
val lines: DStream[String] = ssc.socketTextStream("CentOS",9999)
lines.flatMap(_.split("\\s+"))
.map((_,1))
.mapWithState(StateSpec.function((k:String,v:Option[Int],state:State[Int])=>{
var historyCount=0
if(state.exists()){
historyCount=state.get()
}
historyCount += v.getOrElse(0)
//更新状态
state.update(historyCount)
(k,historyCount)
}))
.print()
必须设定checkpointdir用于存储程序的状态信息。
DStream故障恢复
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, State, StateSpec, StreamingContext}
object SparkWordCountFailRecorver {
def main(args: Array[String]): Unit = {
var checkpointDir="hdfs://CentOS:9000/spark-checkpoint1"
//先检查检查点能否恢复,如果不能恢复,执行recoveryFunction
var ssc= StreamingContext.getOrCreate(checkpointDir,recoveryFunction)
ssc.sparkContext.setLogLevel("FATAL")
ssc.checkpoint("hdfs://CentOS:9000/spark-checkpoint1")//状态快照
//5.启动流计算
ssc.start()
ssc.awaitTermination()
}
var recoveryFunction=()=>{
println("======recoveryFunction========")
Thread.sleep(3000)
val conf = new SparkConf()
.setAppName("SparkWordCountTopology")
.setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(1))
val lines: DStream[String] = ssc.socketTextStream("CentOS",9999)
def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
val newCount = newValues.sum + runningCount.getOrElse(0)
Some(newCount)
}
lines.flatMap(_.split("\\s+"))
.map((_,1))
.mapWithState(StateSpec.function((k:String,v:Option[Int],state:State[Int])=>{
var historyCount=0
if(state.exists()){
historyCount=state.get()
}
historyCount += v.getOrElse(0)
//更新状态
state.update(historyCount)
(k,historyCount)
}))
.print()
ssc
}
}
诟病:一旦状态持久化后,用户许修改代码就不可见了,因为系统并不会调用
recoveryFunction
,如果希望修改的代码生效,必须手动删除检查点目录。