维基百科有一个IRC通道记录了所有的编辑信息,本例是Flink通过读取该通道统计每个用户编辑的字节数。这个是一个非常简单的流分析应用,可在此基础上构建更加复杂的流处理。

配置Maven工程

使用Flink Maven Archetype创建工程,命令如下:

$ mvn archetype:generate \
    -DarchetypeGroupId=org.apache.flink \
    -DarchetypeArtifactId=flink-quickstart-java \
    -DarchetypeVersion=1.6.1 \
    -DgroupId=wiki-edits \
    -DartifactId=wiki-edits \
    -Dversion=0.1 \
    -Dpackage=wikiedits \
    -DinteractiveMode=false

可以根据自己喜好更改groupId,artifactId,package。工程结构如下:

$ tree wiki-edits
wiki-edits/
├── pom.xml
└── src
    └── main
        ├── java
        │   └── wikiedits
        │       ├── BatchJob.java
        │       ├── SocketTextStreamWordCount.java
        │       ├── StreamingJob.java
        │       └── WordCount.java
        └── resources
            └── log4j.properties

这是提供的pom.xml已经包含了依赖和其他几个例子在src/main/java下,可以删除其他的不需要的例子:

$ rm wiki-edits/src/main/java/wikiedits/*.java

最后需要在pom中加入Flink的维基connector依赖,至此,依赖部分结构如下:

<dependencies>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-java_2.11</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-clients_2.11</artifactId>
        <version>${flink.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-wikiedits_2.11</artifactId>
        <version>${flink.version}</version>
    </dependency>
</dependencies>

注意flink-connector-wikiedits_2.11依赖已经添加了。 

写一个Flink程序

打开IDE,导入maven工程,创建一个如下的文件:

package wikiedits;

public class WikipediaAnalysis {

    public static void main(String[] args) throws Exception {

    }
}

程序结构非常简单,但是会向里填充。这里不会给出import语句,因为IDE会自动导入。如果你想跳过开头,那么在结尾的地方会给出完整代码,包含import部分。 

Flink程序的第一步是创建StreamExecutionEnvironment,如果是批处理则是ExecutionEnvironment。它可以用来设置执行参数和读外部存储系统来创建源。在main方法里这样:

StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();

接下来,从读取维基的IRC日志创建源:

DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());

 创建了一个包含WikipediaEditEvent元素的数据流,后续可以进一步进行处理。本例是在确定的时间窗口内统计每个用户增加和删除的字节数,以5秒为例,即我们需要在流上指定key为用户名,也就是说以用户名为维度进行统计。在本例中统计的字节数是在窗口时间内针对唯一的用户而言的。因此需要提供一个键选择器,如:

KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
    .keyBy(new KeySelector<WikipediaEditEvent, String>() {
        @Override
        public String getKey(WikipediaEditEvent event) {
            return event.getUser();
        }
    });

 上述流有一个String类型的key,即用户名;现在就可以在窗口内计算了,在无限的流上聚合,需要指定窗口,本例是在5秒内统计字节数。

DataStream<Tuple2<String, Long>> result = keyedEdits
    .timeWindow(Time.seconds(5))
    .fold(new Tuple2<>("", 0L), new FoldFunction<WikipediaEditEvent, Tuple2<String, Long>>() {
        @Override
        public Tuple2<String, Long> fold(Tuple2<String, Long> acc, WikipediaEditEvent event) {
            acc.f0 = event.getUser();
            acc.f1 += event.getByteDiff();
            return acc;
        }
    });

第一个.timeWindow指定滑动窗口(不重叠)是5秒钟,第二个.fold是在窗口时间片内对唯一的key做Fold转换。开始时给一个初始值 ("", 0L),并向其添加窗口内每个用户编辑的字节数。结果流是每个用户包含一个Tuple2<String, Long>并且没5秒钟发送一次 。下面打印结果,执行程序。

result.print();

see.execute();

最后一步是必须的。它会构建一个内部执行顺序图。 

完整的代码:

package wikiedits;

import org.apache.flink.api.common.functions.FoldFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditEvent;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditsSource;

public class WikipediaAnalysis {

  public static void main(String[] args) throws Exception {

    StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();

    DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());

    KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
      .keyBy(new KeySelector<WikipediaEditEvent, String>() {
        @Override
        public String getKey(WikipediaEditEvent event) {
          return event.getUser();
        }
      });

    DataStream<Tuple2<String, Long>> result = keyedEdits
      .timeWindow(Time.seconds(5))
      .fold(new Tuple2<>("", 0L), new FoldFunction<WikipediaEditEvent, Tuple2<String, Long>>() {
        @Override
        public Tuple2<String, Long> fold(Tuple2<String, Long> acc, WikipediaEditEvent event) {
          acc.f0 = event.getUser();
          acc.f1 += event.getByteDiff();
          return acc;
        }
      });

    result.print();

    see.execute();
  }
}

在IDE中运行:

$ mvn clean package
$ mvn exec:java -Dexec.mainClass=wikiedits.WikipediaAnalysis

 输出结果类似下面这样:

1> (Fenix down,114)
6> (AnomieBOT,155)
8> (BD2412bot,-3690)
7> (IgnorantArmies,49)
3> (Ckh3111,69)
5> (Slade360,0)
7> (Narutolovehinata5,2195)
6> (Vuyisa2001,79)
4> (Ms Sarah Welch,269)
4> (KasparBot,-245)

 附加的练习:写到kafka中

首先,添加kafka依赖:

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka-0.8_2.11</artifactId>
    <version>${flink.version}</version>
</dependency>

 修改程序:

result
    .map(new MapFunction<Tuple2<String,Long>, String>() {
        @Override
        public String map(Tuple2<String, Long> tuple) {
            return tuple.toString();
        }
    })
    .addSink(new FlinkKafkaProducer08<>("localhost:9092", "wiki-result", new SimpleStringSchema()));

 相关的类也需要被导入尽来:

import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer08;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.common.functions.MapFunction;

把流从 Tuple2<String, Long>改为 String,是因为String容易写入kafka,注意写入kafka的配置要改成自己的。接下来,打包:

$ mvn clean package

启动本地flink集群:

$ cd my/flink/directory
$ bin/start-cluster.sh

再创建Kafka Topic,用户写入消息:

$ cd my/kafka/directory
$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --topic wiki-results

运行程序jar包:

$ cd my/flink/directory
$ bin/flink run -c wikiedits.WikipediaAnalysis path/to/wikiedits-0.1.jar

 输出结果类似下面这样:

03/08/2016 15:09:27 Job execution switched to status RUNNING.
03/08/2016 15:09:27 Source: Custom Source(1/1) switched to SCHEDULED
03/08/2016 15:09:27 Source: Custom Source(1/1) switched to DEPLOYING
03/08/2016 15:09:27 TriggerWindow(TumblingProcessingTimeWindows(5000), FoldingStateDescriptor{name=window-contents, defaultValue=(,0), serializer=null}, ProcessingTimeTrigger(), WindowedStream.fold(WindowedStream.java:207)) -> Map -> Sink: Unnamed(1/1) switched to SCHEDULED
03/08/2016 15:09:27 TriggerWindow(TumblingProcessingTimeWindows(5000), FoldingStateDescriptor{name=window-contents, defaultValue=(,0), serializer=null}, ProcessingTimeTrigger(), WindowedStream.fold(WindowedStream.java:207)) -> Map -> Sink: Unnamed(1/1) switched to DEPLOYING
03/08/2016 15:09:27 TriggerWindow(TumblingProcessingTimeWindows(5000), FoldingStateDescriptor{name=window-contents, defaultValue=(,0), serializer=null}, ProcessingTimeTrigger(), WindowedStream.fold(WindowedStream.java:207)) -> Map -> Sink: Unnamed(1/1) switched to RUNNING
03/08/2016 15:09:27 Source: Custom Source(1/1) switched to RUNNING

在kafka topic中查看消息:

bin/kafka-console-consumer.sh  --zookeeper localhost:2181 --topic wiki-result

 也可以做面板中查看运行的job:http://localhost:8081

flink监听mysql bin log文件_apache

可以点开正在运行的job查看详细信息:

flink监听mysql bin log文件_apache_02