Flink程序部署
本地部署
package com.baizhi.jsy.deploy
import org.apache.flink.streaming.api.scala._
object FlinkWordCountCreateLocal {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.createLocalEnvironment(3)
//2.创建DataStream - 细化
val text = env.socketTextStream("Centos",9999)
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏行行流计算任务
env.execute("Window Stream WordCount")
}
}
远程部署
[root@CentOS ~]# cd /usr/flink-1.10.0/
[root@CentOS flink-1.10.0]# ./bin/flink run
--class com.baizhi.quickstart.FlinkWordCountQiuckStart
--detached # 后台提交
--parallelism 4 #指定程序默认并行度
--jobmanager CentOS:8081 # 提交目标主机
/root/flink-datastream-1.0-SNAPSHOT.jar
Job has been submitted with JobID f2019219e33261de88a1678fdc78c696
StreamExecutionEnvironment.getExecutionEnvironment自动识别运行环境,如果运行环境是idea,
系统会自动切换成本地模式,默认系统的并行度使⽤用系统最大线程数,等价于Spark中设置的
local[*] ,如果是生产环境,需要用户在提交任务的时候指定并行度 --parallelism
查看现有执行任务
[root@Centos flink-1.10.0]# ./bin/flink list --running --jobmanager Centos:8081
Waiting for response...
------------------ Running/Restarting Jobs -------------------
05.03.2020 12:38:51 : 9563c3b5c7bc9b0f36326eaa51e27d95 : Window Stream WordCount (RUNNING)
--------------------------------------------------------------
No scheduled jobs.
查看所有执行任务(包括执行过的)
[root@Centos flink-1.10.0]# ./bin/flink list --all --jobmanager Centos:8081
Waiting for response...
No running jobs.
No scheduled jobs.
---------------------- Terminated Jobs -----------------------
05.03.2020 12:38:51 : 9563c3b5c7bc9b0f36326eaa51e27d95 : Window Stream WordCount (CANCELED)
--------------------------------------------------------------
取消指定任务
[root@Centos flink-1.10.0]#./bin/flink cancel --jobmanager Centos:8081 9563c3b5c7bc9b0f36326eaa51e27d95
Cancelling job 9563c3b5c7bc9b0f36326eaa51e27d95.
Cancelled job 9563c3b5c7bc9b0f36326eaa51e27d95.
查看程序执行计划
[root@CentOS flink-1.10.0]# ./bin/flink info --class
com.baizhi.quickstart.FlinkWordCountQiuckStart --parallelism 4 /root/flink-
datastream-1.0-SNAPSHOT.jar
----------------------- Execution Plan -----------------------
{"nodes":[{"id":1,"type":"Source: Socket Stream","pact":"Data
Source","contents":"Source: Socket Stream","parallelism":1},{"id":2,"type":"Flat
Map","pact":"Operator","contents":"Flat Map","parallelism":4,"predecessors":
[{"id":1,"ship_strategy":"REBALANCE","side":"second"}]},
{"id":3,"type":"Map","pact":"Operator","contents":"Map","parallelism":4,"predecessors"
:[{"id":2,"ship_strategy":"FORWARD","side":"second"}]},
{"id":5,"type":"aggregation","pact":"Operator","contents":"aggregation","parallelism":
4,"predecessors":[{"id":3,"ship_strategy":"HASH","side":"second"}]},
{"id":6,"type":"Sink: Print to Std. Out","pact":"Data Sink","contents":"Sink: Print to
Std. Out","parallelism":4,"predecessors":
[{"id":5,"ship_strategy":"FORWARD","side":"second"}]}]}
--------------------------------------------------------------
No description provided.
用户可以访问:https://flink.apache.org/visualizer/将json数据粘贴过去,查看Flink执行计划图
跨平台发布
在运行之前需要使用mvn重新打包程序。直接运行main函数即可
package com.baizhi.jsy.deploy
import org.apache.flink.streaming.api.scala._
object FlinkWordCountCorssplatRemove {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
//copy path打包后的jar
var jars="D:\\idea_code\\flink\\flink_wordCount\\target\\flink_wordCount-1.0-SNAPSHOT.jar"
val env = StreamExecutionEnvironment.createRemoteEnvironment("Centos",8081,jars)
//设置默认并行度
env.setParallelism(4)
//2.创建DataStream - 细化
val text = env.socketTextStream("Centos",9999)
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏行行流计算任务
env.execute("Window Stream WordCount")
}
}
Streaming (DataStream API)
DataSource
数据源是程序读取数据的来源,用户可以通过 env.addSource(SourceFunction) ,将SourceFunction添加到程序中。Flink内置许多已知实现的SourceFunction,但是用户可以自定义实现SourceFunction (非并行化的接口)接口或者实现 ParallelSourceFunction (并行化)接口,如果需要有状态管理理还可以继承RichParallelSourceFunction .
File-based
只读取一次在/demo/words的文件
readTextFile(path) - Reads(once) text files, i.e. files that respect the TextInputFormatspecification, line-by-line and returns them as Strings.
package com.baizhi.jsy.sources
import org.apache.flink.streaming.api.scala._
object FlinkWordCountFileSources {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化 下面两个都是读取一次
val text:DataStream[String] = env.readTextFile("hdfs://Centos:9000/demo/words")
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏行行流计算任务
env.execute("Window Stream WordCount")
}
}
readFile(fileInputFormat, path) - Reads (once) files as dictated by the specified file inputformat.
package com.baizhi.jsy.sources
import org.apache.flink.api.common.io.FileInputFormat
import org.apache.flink.api.java.io.TextInputFormat
import org.apache.flink.streaming.api.scala._
object FlinkWordCountFileSources {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化 下面两个都是读取一次
var inputFormat:FileInputFormat[String]=new TextInputFormat(null)//下面有路径 可以不给了
val text = env.readFile(inputFormat,"hdfs://Centos:9000/demo/words")
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏行行流计算任务
env.execute("Window Stream WordCount")
}
}
以上两种方式只读取一次在/demo/words的文件。
readFile(fileInputFormat,path,watchType,interval,pathFilter,typeInfo)-
这是前两个内部调用的方法。 它根据给定fileInputFormat。 根据提供的watchType,此来源可能会定期监视(每间隔毫秒)新数据的路径(FileProcessingMode.PROCESS_CONTINUOUSLY),或处理一次路径中当前的数据并退出(FileProcessingMode.PROCESS_ONCE)。 使用pathFilter,用户可以进一步排除文件被处理。
该方法会检查采集目录下的文件,如果文件发生变化系统会重新采集。此时可能会导致文件的重复计算。⼀一般来说不建议修改文件内容,直接上传新文件即可
package com.baizhi.jsy.sources
import org.apache.flink.api.common.io.FileInputFormat
import org.apache.flink.api.java.io.TextInputFormat
import org.apache.flink.streaming.api.functions.source.FileProcessingMode
import org.apache.flink.streaming.api.scala._
object FlinkWordCountFileSources {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化 下面两个都是读取一次
var inputFormat:FileInputFormat[String]=new TextInputFormat(null)
val text = env.readFile(inputFormat,"hdfs://Centos:9000/demo/words"
,FileProcessingMode.PROCESS_CONTINUOUSLY,1000)
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏行行流计算任务
env.execute("Window Stream WordCount")
}
}
Socket Based
socketTextStream - Reads from a socket. Elements can be separated by a delimiter.
package com.baizhi.jsy.sources
import org.apache.flink.streaming.api.scala._
object FlinkWordCountSocketTextStream {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text = env.socketTextStream("Centos",9999,'\n',3)
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏行行流计算任务
env.execute("Window Stream WordCount")
}
}
Collection-based
package com.baizhi.jsy.sources
import org.apache.flink.streaming.api.scala._
object FlinkWordCountCollection {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text = env.fromCollection(List("this is a demo","hello world jiang si yu"))
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏行行流计算任务
env.execute("Window Stream WordCount")
}
}
UserDefinedSource
SourceFunction
package com.baizhi.jsy.sources
import org.apache.flink.streaming.api.scala._
object FlinkWordCountFileSourcesUserDefineNonParallel {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
//2.创建DataStream - 细化 下面两个都是读取一次
val text = env.addSource[String](new UserDefineNonParallelSourceFunction)
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
println(env.getExecutionPlan)
//5.执⾏行行流计算任务
env.execute("Window Stream WordCount")
}
}
package com.baizhi.jsy.sources
import org.apache.flink.streaming.api.functions.source.SourceFunction
import scala.util.Random
class UserDefineNonParallelSourceFunction extends SourceFunction[String] {
@volatile //防⽌止线程拷⻉贝变量量
var isRunning:Boolean=true
val lines:Array[String] = Array("this is a demo","hello world","ni hao me")
//在该⽅方法中启动线程,通过sourceContext的collect⽅方法发送数据
override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {
while (isRunning){
Thread.sleep(100)
//输送数据给下游
sourceContext.collect(lines(new Random().nextInt(lines.size)))
}
}
//释放资源
override def cancel(): Unit = {
isRunning=false
}
}
ParallelSourceFunction
package com.baizhi.jsy.sources
import org.apache.flink.streaming.api.scala._
object FlinkWordCountFileSourcesUserDefineParallel {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
//2.创建DataStream - 细化 下面两个都是读取一次
val text = env.addSource[String](new UserDefineParallelSourceFunction)
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
println(env.getExecutionPlan)
//5.执⾏行行流计算任务
env.execute("Window Stream WordCount")
}
}
package com.baizhi.jsy.sources
import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction, SourceFunction}
import scala.util.Random
class UserDefineParallelSourceFunction extends ParallelSourceFunction[String] {
@volatile //防⽌止线程拷⻉贝变量量
var isRunning:Boolean=true
val lines:Array[String] = Array("this is a demo","hello world","ni hao")
//在该⽅方法中启动线程,通过sourceContext的collect⽅方法发送数据
override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {
while (isRunning){
Thread.sleep(100)
//输送数据给下游
sourceContext.collect(lines(new Random().nextInt(lines.size)))
}
}
override def cancel(): Unit = {
isRunning=false
}
}
√Kafka集成
引入maven
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.10.0</version>
</dependency>
SimpleStringSchema
该SimpleStringSchema方案只会反序列列化kafka中的value
启动hdfs zk kafka
package com.baizhi.jsy.kafka
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
object FlinkWordCountKafkaSource {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val props = new Properties()
props.setProperty("bootstrap.servers", "Centos:9092")
props.setProperty("group.id", "g1")
val text = env.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),props))
//3.执行DataStream的转换算子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏行行流计算任务
env.execute("Window Stream WordCount")
}
}
自定义Schema
KafkaDeserializationSchema
package com.baizhi.jsy.kafka
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
object FlinkWordCountKafkaSourceDeserializationSchema {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val props = new Properties()
props.setProperty("bootstrap.servers", "Centos:9092")
props.setProperty("group.id", "g1")
val text = env.addSource(new FlinkKafkaConsumer[(String,String,Int,Long)]("topic01",new UserDefinedKafkaDeserializationSchema(),props))
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(t=>{
println(t)
t._2.split("\\s+")
})
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏行行流计算任务
env.execute("Window Stream WordCount")
}
}
package com.baizhi.jsy.kafka
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.connectors.kafka.KafkaDeserializationSchema
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.flink.api.scala._
class UserDefinedKafkaDeserializationSchema extends KafkaDeserializationSchema[(String,String,Int,Long)]{
//是否流截止
override def isEndOfStream(t: (String, String, Int, Long)): Boolean = false
override def deserialize(consumerRecord: ConsumerRecord[Array[Byte], Array[Byte]]): (String, String, Int, Long) = {
if(consumerRecord.key()!=null){
(new String(consumerRecord.key()),new String(consumerRecord.value()),consumerRecord.partition(),consumerRecord.offset())
}else{
("",new String(consumerRecord.value()),consumerRecord.partition(),consumerRecord.offset())
}
}
override def getProducedType: TypeInformation[(String, String, Int, Long)] = {
createTypeInformation[(String, String, Int, Long)]
}
}
JSONKeyValueNodeDeserializationSchema
要求Kafka中的topic的key和value都必须是json格式,也可以在使用的时候,指定是否读取元数据
(topic、分区、o!set等)
package com.baizhi.jsy.kafka
import java.util.Properties
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.node.ObjectNode
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.util.serialization.JSONKeyValueDeserializationSchema
object FlinkWordCountKafkaSourceJson {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val props = new Properties()
props.setProperty("bootstrap.servers", "Centos:9092")
props.setProperty("group.id", "g1")
val text = env.addSource(new FlinkKafkaConsumer[ObjectNode]("topic01",new JSONKeyValueDeserializationSchema(true),props))
//输入数据格式 {"id":1,"name":"zhangsan"}
//t:{"value":{"id":1,"name":"zhangsan"},"metadata":{"offset":0,"topic":"topic01","partition":13}}
text.map(t=>(t.get("value").get("id").asInt(),t.get("value").get("name").asText())).print()
//5.执⾏行行流计算任务
env.execute("Window Stream WordCount")
}
}