维基百科有一个IRC通道记录了所有的编辑信息,本例是Flink通过读取该通道统计每个用户编辑的字节数。这个是一个非常简单的流分析应用,可在此基础上构建更加复杂的流处理。
配置Maven工程
使用Flink Maven Archetype创建工程,命令如下:
$ mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-java \
-DarchetypeVersion=1.6.1 \
-DgroupId=wiki-edits \
-DartifactId=wiki-edits \
-Dversion=0.1 \
-Dpackage=wikiedits \
-DinteractiveMode=false
可以根据自己喜好更改groupId,artifactId,package。工程结构如下:
$ tree wiki-edits
wiki-edits/
├── pom.xml
└── src
└── main
├── java
│ └── wikiedits
│ ├── BatchJob.java
│ ├── SocketTextStreamWordCount.java
│ ├── StreamingJob.java
│ └── WordCount.java
└── resources
└── log4j.properties
这是提供的pom.xml已经包含了依赖和其他几个例子在src/main/java下,可以删除其他的不需要的例子:
$ rm wiki-edits/src/main/java/wikiedits/*.java
最后需要在pom中加入Flink的维基connector依赖,至此,依赖部分结构如下:
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-wikiedits_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
</dependencies>
注意flink-connector-wikiedits_2.11依赖已经添加了。
写一个Flink程序
打开IDE,导入maven工程,创建一个如下的文件:
package wikiedits;
public class WikipediaAnalysis {
public static void main(String[] args) throws Exception {
}
}
程序结构非常简单,但是会向里填充。这里不会给出import语句,因为IDE会自动导入。如果你想跳过开头,那么在结尾的地方会给出完整代码,包含import部分。
Flink程序的第一步是创建StreamExecutionEnvironment,如果是批处理则是ExecutionEnvironment。它可以用来设置执行参数和读外部存储系统来创建源。在main方法里这样:
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
接下来,从读取维基的IRC日志创建源:
DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());
创建了一个包含WikipediaEditEvent元素
的数据流,后续可以进一步进行处理。本例是在确定的时间窗口内统计每个用户增加和删除的字节数,以5秒为例,即我们需要在流上指定key为用户名,也就是说以用户名为维度进行统计。在本例中统计的字节数是在窗口时间内针对唯一的用户而言的。因此需要提供一个键选择器,如:
KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
.keyBy(new KeySelector<WikipediaEditEvent, String>() {
@Override
public String getKey(WikipediaEditEvent event) {
return event.getUser();
}
});
上述流有一个String类型的key,即用户名;现在就可以在窗口内计算了,在无限的流上聚合,需要指定窗口,本例是在5秒内统计字节数。
DataStream<Tuple2<String, Long>> result = keyedEdits
.timeWindow(Time.seconds(5))
.fold(new Tuple2<>("", 0L), new FoldFunction<WikipediaEditEvent, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> fold(Tuple2<String, Long> acc, WikipediaEditEvent event) {
acc.f0 = event.getUser();
acc.f1 += event.getByteDiff();
return acc;
}
});
第一个.timeWindow指定滑动窗口(不重叠)是5秒钟,第二个.fold是在窗口时间片内对唯一的key做Fold转换。开始时给一个初始值 ("", 0L),并向其添加窗口内每个用户编辑的字节数。结果流是每个用户包含一个Tuple2<String, Long>并且没5秒钟发送一次 。下面打印结果,执行程序。
result.print();
see.execute();
最后一步是必须的。它会构建一个内部执行顺序图。
完整的代码:
package wikiedits;
import org.apache.flink.api.common.functions.FoldFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditEvent;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditsSource;
public class WikipediaAnalysis {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());
KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
.keyBy(new KeySelector<WikipediaEditEvent, String>() {
@Override
public String getKey(WikipediaEditEvent event) {
return event.getUser();
}
});
DataStream<Tuple2<String, Long>> result = keyedEdits
.timeWindow(Time.seconds(5))
.fold(new Tuple2<>("", 0L), new FoldFunction<WikipediaEditEvent, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> fold(Tuple2<String, Long> acc, WikipediaEditEvent event) {
acc.f0 = event.getUser();
acc.f1 += event.getByteDiff();
return acc;
}
});
result.print();
see.execute();
}
}
在IDE中运行:
$ mvn clean package
$ mvn exec:java -Dexec.mainClass=wikiedits.WikipediaAnalysis
输出结果类似下面这样:
1> (Fenix down,114)
6> (AnomieBOT,155)
8> (BD2412bot,-3690)
7> (IgnorantArmies,49)
3> (Ckh3111,69)
5> (Slade360,0)
7> (Narutolovehinata5,2195)
6> (Vuyisa2001,79)
4> (Ms Sarah Welch,269)
4> (KasparBot,-245)
附加的练习:写到kafka中
首先,添加kafka依赖:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.8_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
修改程序:
result
.map(new MapFunction<Tuple2<String,Long>, String>() {
@Override
public String map(Tuple2<String, Long> tuple) {
return tuple.toString();
}
})
.addSink(new FlinkKafkaProducer08<>("localhost:9092", "wiki-result", new SimpleStringSchema()));
相关的类也需要被导入尽来:
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer08;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.common.functions.MapFunction;
把流从 Tuple2<String, Long>
改为 String,是因为String容易写入kafka,注意写入kafka的配置要改成自己的。接下来,打包:
$ mvn clean package
启动本地flink集群:
$ cd my/flink/directory
$ bin/start-cluster.sh
再创建Kafka Topic,用户写入消息:
$ cd my/kafka/directory
$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --topic wiki-results
运行程序jar包:
$ cd my/flink/directory
$ bin/flink run -c wikiedits.WikipediaAnalysis path/to/wikiedits-0.1.jar
输出结果类似下面这样:
03/08/2016 15:09:27 Job execution switched to status RUNNING.
03/08/2016 15:09:27 Source: Custom Source(1/1) switched to SCHEDULED
03/08/2016 15:09:27 Source: Custom Source(1/1) switched to DEPLOYING
03/08/2016 15:09:27 TriggerWindow(TumblingProcessingTimeWindows(5000), FoldingStateDescriptor{name=window-contents, defaultValue=(,0), serializer=null}, ProcessingTimeTrigger(), WindowedStream.fold(WindowedStream.java:207)) -> Map -> Sink: Unnamed(1/1) switched to SCHEDULED
03/08/2016 15:09:27 TriggerWindow(TumblingProcessingTimeWindows(5000), FoldingStateDescriptor{name=window-contents, defaultValue=(,0), serializer=null}, ProcessingTimeTrigger(), WindowedStream.fold(WindowedStream.java:207)) -> Map -> Sink: Unnamed(1/1) switched to DEPLOYING
03/08/2016 15:09:27 TriggerWindow(TumblingProcessingTimeWindows(5000), FoldingStateDescriptor{name=window-contents, defaultValue=(,0), serializer=null}, ProcessingTimeTrigger(), WindowedStream.fold(WindowedStream.java:207)) -> Map -> Sink: Unnamed(1/1) switched to RUNNING
03/08/2016 15:09:27 Source: Custom Source(1/1) switched to RUNNING
在kafka topic中查看消息:
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic wiki-result
也可以做面板中查看运行的job:http://localhost:8081
可以点开正在运行的job查看详细信息: