flink 的 flatMap 作用 flink map flatmap

转载

mob64ca13fc5fb6 2024-04-01 10:50:05

文章标签 flink 的 flatMap 作用 flink 流计算 apache List 文章分类 架构后端开发

一句话概括flink:
flink核心是一个流式的数据流执行引擎，其针对数据流的分布式计算提供了数据分布，数据通信以及容错机制等功能。

WordCount源码

放源码之前，先介绍一下一些预备知识：

首先介绍一下map与flatMap区别

map，就是把一个函数传入map中，然后利用传入的函数，把集合中每个元素做处理，然后把处理后的结果返回。
flatMap与其区别仅仅是返回的是一个列表

然后把一些基本算子过一遍

map
支持用lambda表达式来表达map函数
说白了就是把一个数据流转换成另外一个数据流
下面这段代码就是用kafka作为数据源,把字符串类型的流转换为Student类型的流。

public class kafkatoSink {

    public static void main(String[] args) throws Exception {
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9093");
        props.put("zookeeper.connect", "localhost:2181");
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

        SingleOutputStreamOperator<Student> student = env.addSource(new FlinkKafkaConsumer010<>(
                "student",   //这个 kafka topic 需要和上面的工具类的 topic 一致
                new SimpleStringSchema(),
                props)).setParallelism(1)
                .map(string -> JSON.parseObject(string, Student.class)); //Fastjson 解析字符串成 student 对象

        student.addSink(new SinkFunction<Student>() {
            @Override
            public void invoke(Student value) throws Exception {
                System.out.println(value.name+" "+value.age+""+value.id);
            }
        }); //数据 sink 到 CONSOLE
        
        env.execute("Flink add sink");
    }

}

flatMap

flatMap方法返回的是一个collector

public static final class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
            @Override
            public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
                // normalize and split the line
                String[] tokens = value.toLowerCase().split("\\W+");
                // emit the pairs
                for (String token : tokens) {
                    if (token.length() > 0) {
                        out.collect(new Tuple2<String, Integer>(token, 1));
                    }
                }
            }
        }

filter
实例：
lambda返回true or false

// 根据domain字段，过滤数据，只保留BAIDU的domain
SingleOutputStreamOperator<UrlInfo> filterSource = flatSource.filter(urlInfo -> {
            if(StringUtils.equals(UrlInfo.BAIDU,urlInfo.getDomain())){
                return true;
            }
            return false;
        });

        filterSource.addSink(new PrintSinkFunction<>());

keyBy

DataStream -> KeyedStream

把相同key的所有记录分配为同一个分区。在内部,keyBy()是使用散列分区实现的。相当于SQL中的groupBy()函数

dataStream.keyBy(0) //由数组第一个元素作为key

public static void main(String [] args)throws Exception{
        //wordCount();
        StreamExecutionEnvironment env =  StreamExecutionEnvironment.getExecutionEnvironment();
        env.fromElements(Tuple2.of(2L,3L),Tuple2.of(1L,5L),Tuple2.of(1L, 7L), Tuple2.of(2L, 4L), Tuple2.of(1L, 2L))
                .keyBy(0)
                //这里的mapFunction 把Tuple转化为String
                .map((MapFunction<Tuple2<Long,Long>,String>) tuple->
                        "Key:"+tuple.f0+",value:"+tuple.f1)
                .print();
        env.execute("execute");
    }

result:

6> Key:1,value:5
6> Key:1,value:7
6> Key:1,value:2
8> Key:2,value:3
8> Key:2,value:4

6>,8>表示分组，这里是按照tuple第一个元素作为key来分组的

reduce

reduce表示将数据合并成一个新的数据，返回单个的结果值，并且 reduce 操作每处理一个元素总是创建一个新值。

所以reduce需要针对分组或者一个window(窗口)来执行，也就是分别对应于keyBy、window/timeWindow 处理后的数据，根据ReduceFunction将元素与上一个reduce后的结果合并，产出合并之后的结果。

public static void main(String [] args)throws Exception{
        //wordCount();
        StreamExecutionEnvironment env =  StreamExecutionEnvironment.getExecutionEnvironment();
        env.fromElements(Tuple2.of(2L,3L),Tuple2.of(1L,5L),Tuple2.of(1L, 7L), Tuple2.of(2L, 4L), Tuple2.of(1L, 2L))
                .keyBy(0)
                //这里的mapFunction 把Tuple转化为String
                .reduce((ReduceFunction<Tuple2<Long,Long>>)(t2,t1) -> new Tuple2<>(t1.f0,t1.f1+t2.f1))
                .print();

        env.execute("execute");
    }

8> (2,3)
8> (2,7)
6> (1,5)
6> (1,12)
6> (1,14)

Aggregate

KeyedStream -> dataStream

public static void main(String [] args)throws Exception{
        //wordCount();
        StreamExecutionEnvironment env =  StreamExecutionEnvironment.getExecutionEnvironment();
        KeyedStream keyedStream = env.fromElements(Tuple2.of(2L,3L),Tuple2.of(1L,5L),Tuple2.of(1L, 7L), Tuple2.of(2L, 4L), Tuple2.of(1L, 2L))
                .keyBy(0);
        SingleOutputStreamOperator<Tuple2> sumStream = keyedStream.sum(1);
        sumStream.addSink(new PrintSinkFunction<>());
        env.execute("execute");

至于min,minBy什么的查文档吧

现在上wordCount源码

package wordCount;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

public class wordCount {

    public static void main(String [] args) throws Exception{

        if(args.length!=2){
            System.err.println();
            return;
        }
        String hostname = args[0];
        Integer port = Integer.parseInt(args[1]);

        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> text = env.socketTextStream(hostname,port);

        text.flatMap(new LineSplitter()).setParallelism(1)
                .keyBy(0)
                .sum(1).setParallelism(1)
                .print();
        env.execute("wordCount SocketTextStream");
    }


    public static final class LineSplitter implements FlatMapFunction<String, Tuple2<String,Integer>>{
        @Override
        public void flatMap(String value, Collector<Tuple2<String,Integer>> out){

            String [] tokens = value.toLowerCase().split("\\W+");

            for(String token:tokens){
                if(token.length()>0){
                    out.collect(new Tuple2<>(token,1));
                }

            }

        }
    }

在终端用

nc -l 9000

然后运行程序

flink 的 flatMap 作用 flink map flatmap_List

Flink的图结构

StreamGraph 程序的拓扑结构，从代码直接生成的图

flink 的 flatMap 作用 flink map flatmap_flink 的 flatMap 作用_02

JobGraph 交给flink去生成task的图

flink 的 flatMap 作用 flink map flatmap_apache_03

数据从operator到另外一个operator的时候，上游作为生产者提供了intermediateDataset，下游作为消费者需要jobEdge.jobEdge是一个通信管道，连接了上游生产的dataset和下游的jobVertex。

ExecutionGraph　真正执行的一层

flink 的 flatMap 作用 flink map flatmap_apache_04

jobGraph转换到ExecutionGraph过程中

加入了并行度概念，变为真正可以调度的图结构。
生成了与jobVertex对应的ExecutionJobVertex,ExecutionVertex，与IntermediateDataset对应的IntermediateResult和IntermediateResultPartition等
executionGraph可以用于调度任务，flink根据该图生成了一一对应的task，每一个task对应一个execution

总结而言，
streamGraph是对用户逻辑的映射。

jobGraph在此基础上进行了一些优化。例如把一部分操作串成chain以提高效率。

executionGraph是为了调度存在的，并且加入并行处理的概念。

streamGraph的生成

源码路径：

flink把每一个算子transform成一个对流的转换，并且注册到执行环境中。

public StreamGraphGenerator(List<Transformation<?>> transformations, 
	ExecutionConfig executionConfig, CheckpointConfig checkpointConfig) {

/org/apache/flink/streaming/api/graph/StreamGraphGenerator.java

StreamTransform类代表了流的转换

本质上是一个或者多个dataStream生成新的dataStream的操作。

flink 的 flatMap 作用 flink map flatmap_apache_05

首先，用户代码里定义的UDF会被当作其基类对待，然后交给StreamMap这个operator做进一步包装。事实上，每一个Transformation都对应了一个StreamOperator。
由于map这个操作只接受一个输入，所以再被进一步包装为OneInputTransformation。
最后，将该transformation注册到执行环境中，当执行上文提到的generate方法时，生成StreamGraph图结构。

JobGraph的生成

flink会根据上一步生成的streamGraph生成JobGraph，然后把JobGraph发送到server端进行executionGraph的解析。

jobGraph生成的源码　
/home/alex/.m2/repository/org/apache/flink/flink-streaming-java_2.11/1.6.0/flink-streaming-java_2.11-1.6.0-sources.jar!/org/apache/flink/streaming/api/graph/StreamingJobGraphGenerator.java

private JobGraph createJobGraph() {

		// make sure that all vertices start immediately
		jobGraph.setScheduleMode(ScheduleMode.EAGER);

		// Generate deterministic hashes for the nodes in order to identify them across
		// submission iff they didn't change.
		Map<Integer, byte[]> hashes = defaultStreamGraphHasher.traverseStreamGraphAndGenerateHashes(streamGraph);

		// Generate legacy version hashes for backwards compatibility
		List<Map<Integer, byte[]>> legacyHashes = new ArrayList<>(legacyStreamGraphHashers.size());
		for (StreamGraphHasher hasher : legacyStreamGraphHashers) {
			legacyHashes.add(hasher.traverseStreamGraphAndGenerateHashes(streamGraph));
		}

		Map<Integer, List<Tuple2<byte[], byte[]>>> chainedOperatorHashes = new HashMap<>();

　　//生成job节点，然后串成chain
		setChaining(hashes, legacyHashes, chainedOperatorHashes);

		setPhysicalEdges();

		setSlotSharingAndCoLocation();

		configureCheckpointing();

		JobGraphGenerator.addUserArtifactEntries(streamGraph.getEnvironment().getCachedFiles(), jobGraph);

		// set the ExecutionConfig last when it has been finalized
		try {
			jobGraph.setExecutionConfig(streamGraph.getExecutionConfig());
		}
		catch (IOException e) {
			throw new IllegalConfigurationException("Could not serialize the ExecutionConfig." +
					"This indicates that non-serializable types (like custom serializers) were registered");
		}

		return jobGraph;
	}

这里我们看一下setChaining具体是怎样实现的

private void setChaining(Map<Integer, byte[]> hashes, List<Map<Integer, byte[]>> legacyHashes, Map<Integer, List<Tuple2<byte[], byte[]>>> chainedOperatorHashes) {
		for (Integer sourceNodeId : streamGraph.getSourceIDs()) {
			createChain(sourceNodeId, sourceNodeId, hashes, legacyHashes, 0, chainedOperatorHashes);
		}
	}

大体思想：
遍历节点，如果节点是chain的头结点，就可以生成一个jobVertex
如果不是头结点，就要把自身配置并入头结点，然后把头结点和自己的出边相连。
对于不能chain的节点，当做只有头结点处理

private List<StreamEdge> createChain(
			Integer startNodeId,
			Integer currentNodeId,
			Map<Integer, byte[]> hashes,
			List<Map<Integer, byte[]>> legacyHashes,
			int chainIndex,
			Map<Integer, List<Tuple2<byte[], byte[]>>> chainedOperatorHashes) {

只对没有build的节点做处理
		if (!builtVertices.contains(startNodeId)) {
			
			List<StreamEdge> transitiveOutEdges = new ArrayList<StreamEdge>();
			List<StreamEdge> chainableOutputs = new ArrayList<StreamEdge>();
			List<StreamEdge> nonChainableOutputs = new ArrayList<StreamEdge>();

//先把所有的出边扫一遍，判断其能否连成一个chain
			for (StreamEdge outEdge : streamGraph.getStreamNode(currentNodeId).getOutEdges()) {
				if (isChainable(outEdge, streamGraph)) {
					chainableOutputs.add(outEdge);
				} else {
					nonChainableOutputs.add(outEdge);
				}
			}


如果不是头结点，把头结点和自己的出边相连，对于不能chain的节点，当做只有头结点处理
			for (StreamEdge chainable : chainableOutputs) {
				transitiveOutEdges.addAll(
						createChain(startNodeId, chainable.getTargetId(), hashes, legacyHashes, chainIndex + 1, chainedOperatorHashes));
			}

			for (StreamEdge nonChainable : nonChainableOutputs) {
				transitiveOutEdges.add(nonChainable);
				createChain(nonChainable.getTargetId(), nonChainable.getTargetId(), hashes, legacyHashes, 0, chainedOperatorHashes);
			}

			List<Tuple2<byte[], byte[]>> operatorHashes =
				chainedOperatorHashes.computeIfAbsent(startNodeId, k -> new ArrayList<>());

			byte[] primaryHashBytes = hashes.get(currentNodeId);

			for (Map<Integer, byte[]> legacyHash : legacyHashes) {
				operatorHashes.add(new Tuple2<>(primaryHashBytes, legacyHash.get(currentNodeId)));
			}

			chainedNames.put(currentNodeId, createChainedName(currentNodeId, chainableOutputs));
			chainedMinResources.put(currentNodeId, createChainedMinResources(currentNodeId, chainableOutputs));
			chainedPreferredResources.put(currentNodeId, createChainedPreferredResources(currentNodeId, chainableOutputs));

			StreamConfig config = currentNodeId.equals(startNodeId)
					? createJobVertex(startNodeId, hashes, legacyHashes, chainedOperatorHashes)
					: new StreamConfig(new Configuration());

			setVertexConfig(currentNodeId, config, chainableOutputs, nonChainableOutputs);

			if (currentNodeId.equals(startNodeId)) {

				config.setChainStart();
				config.setChainIndex(0);
				config.setOperatorName(streamGraph.getStreamNode(currentNodeId).getOperatorName());
				config.setOutEdgesInOrder(transitiveOutEdges);
				config.setOutEdges(streamGraph.getStreamNode(currentNodeId).getOutEdges());

				for (StreamEdge edge : transitiveOutEdges) {
					connect(startNodeId, edge);
				}

				config.setTransitiveChainedTaskConfigs(chainedConfigs.get(startNodeId));

			} else {

				Map<Integer, StreamConfig> chainedConfs = chainedConfigs.get(startNodeId);

				if (chainedConfs == null) {
					chainedConfigs.put(startNodeId, new HashMap<Integer, StreamConfig>());
				}
				config.setChainIndex(chainIndex);
				StreamNode node = streamGraph.getStreamNode(currentNodeId);
				config.setOperatorName(node.getOperatorName());
				chainedConfigs.get(startNodeId).put(currentNodeId, config);
			}

			config.setOperatorID(new OperatorID(primaryHashBytes));

			if (chainableOutputs.isEmpty()) {
				config.setChainEnd();
			}
			return transitiveOutEdges;

		} else {
			return new ArrayList<>();
		}
	}

学习一个新的框架，我们得思考其与同类产品的优势是什么：flink与spark这些计算框架相比有什么优势呢？

原来，flink为了更高效地分布式执行，会尽可能将operator的subtask链接在一起形成一个大的task

而一个task在一个线程中运行，这样的好处是：

减少了线程之间的切换
减少消息的序列化与反序列化
减少数据在缓冲区的交换
减少延迟的同时，提高整体的吞吐量

flink 的 flatMap 作用 flink map flatmap_apache_06

key agg与sink两个operator进行合并。合并之后不能改变整体的拓扑结构。

总结一下，把opeator串成一个task，flink避免了数据序列化后通过网络发送给其他节点的开销，能够大大增强效率。

JobGraph的提交依赖于jobClient和jobManager之间的异步通信

flink 的 flatMap 作用 flink map flatmap_flink_07

executionGraph的生成：

其不是在客户端程序生成的，而是在服务端生成的。

一个flink只维护一个jobManager

flink 的 flatMap 作用 flink map flatmap_flink 的 flatMap 作用_08

flink集群启动后，会启动一个jobManager和多个taskManager。

用户代码会提交给jobManager，jobManager再把来自不同用户的任务发送给不同的taskManager去执行。

每个TaskManager管理着多个task，task是执行计算的最小结构。其负责把心跳和统计信息汇报给jobMaanager。

taskManager之间以流的形式进行数据传输。

以上manager都是独立的jvm进程。

taskManager与job并非一一对应的关系。flink调度的最小单元还是task而非taskManager

也就是说，来自不同job的不同task可能运行在同一个taskManager不同线程上。

计算资源的调度

Task slot是一个TaskManager内资源分配的最小载体，代表一个固定大小的资源子集，每个taskManager会把其所占的所有资源平分给它的slot

如果每个taskManager有一个slot,意味着每个task运行在独立的jvm中。

每个taskManager如果有多个slot，就是说多个task运行在同一个jvm中。

而在同一个jvm进程中的task，可以共享tcp链接和心跳信息，可以减少数据网络传输，也能够共享一些数据结构。能够减少每个task的消耗。

每个slot可以接受单个task，也可以接受多个连续task组成的pipeline。flatMap函数占用一个taskslot，而key agg函数和sink函数共用一个taskslot

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：波士顿房价回归预测房地产波士顿矩阵

下一篇：zabbix6 rpm安裝 zabbix4.0安装部署

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯