1.9.Flink入门案例-wordCount
1.9.1.开发工具
1.9.2.编写java版本word-count程序
1.9.2.1.添加Flink Maven依赖
1.9.2.2.编写wordcount的java代码
1.9.2.3.数据准备
1.9.2.4.执行结果
1.9.3.编写scala版本word-count程序
1.9.3.1.添加Flink依赖
1.9.3.2.编写wordcount的scala程序
1.9.4.Flink StreamingWindowWordCount
1.9.5.Flink程序开发步骤
1.9.6.Flink Streaming和Batch的区别 (API使用层面)
1.9.7.在集群上执行程序
1.9.8.集群跑jar的时候pom文件中需要进行的build配置

1.9.Flink入门案例-wordCount

1.9.1.开发工具

官方建议使用Intellij IDEA,因为它默认集成scala和maven环境,使用更加方便
开发flink程序,可以使用java或者scala语言。
个人建议,使用scala,因为实现起来更加简洁。使用java代码实现函数式编程比较别扭。
建议使用maven国内镜像仓库地址
国外仓库下载较慢,可以使用国内阿里云的maven仓库
注意:如果发现国内源下载提示找不到依赖的时候,记得切换回国外源
国内镜像仓库配置见备注

1.9.2.编写java版本word-count程序

新建一个FLINK IDEA工程,如下:

java stream对Bigdecimal汇总_scala

1.9.2.1.添加Flink Maven依赖

你只要将以下依赖添加到pom.xml中,就能在项目中引入Apache Flink。这些依赖项包含了本地执行环境,因此支持本地测试。
Scala API:为了使用Scala API,将flink-java的artifact id替换为flink-scala_2.11,同时将flink-streaming-java_2.11替换为flink-streaming-scala_2.11。

总的pom.xml文件内容如下:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.toto.learn</groupId>
    <artifactId>flink-learn</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>1.11.1</version>
        </dependency>

        <!--
        1.compile   : 默认的scope,运行期有效,需要打入包中。
        2.provided  : 编译器有效,运行期不需要提供,不会打入包中。
        3.runtime   : 编译不需要,在运行期有效,需要导入包中。(接口与实现分离)
        4.test      : 测试需要,不会打入包中
        5.system    : 非本地仓库引入、存在系统的某个路径下的jar。(一般不使用)
        -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_2.12</artifactId>
            <version>1.11.1</version>
            <!--<scope>provided</scope>-->
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_2.12</artifactId>
            <version>1.11.1</version>
        </dependency>

        <!-- 使用scala编程的时候使用下面的依赖 start-->
        <!--
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-scala_2.12</artifactId>
            <version>1.11.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_2.12</artifactId>
            <version>1.11.1</version>
        </dependency>
        -->
        <!-- 使用scala编程的时候使用下面的依赖 end-->

    </dependencies>

    <build>
        <plugins>
            <!-- 编译插件 -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.6.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
            <!-- scala编译插件 -->
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.1.6</version>
                <configuration>
                    <scalaCompatVersion>2.11</scalaCompatVersion>
                    <scalaVersion>2.11.12</scalaVersion>
                    <encoding>UTF-8</encoding>
                </configuration>
                <executions>
                    <execution>
                        <id>compile-scala</id>
                        <phase>compile</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>test-compile-scala</id>
                        <phase>test-compile</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <!-- 打jar包插件(会包含所有依赖) -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>2.6</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <archive>
                        <manifest>
                            <!-- 可以设置jar包的入口类(可选) -->
                            <mainClass>com.toto.learn.batch.BatchWordCountJava</mainClass>
                        </manifest>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>
1.9.2.2.编写wordcount的java代码
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;

/**
 * @author tuzuoquan
 * @version 1.0
 * @ClassName BatchWordCountJava
 * @description TODO
 * @date 2020/9/11 14:03
 **/
public class BatchWordCountJava {

    public static void main(String[] args) throws Exception {
        String inputPath = "E:/workspace/wordcount/input";
        String outputPath = "E:/workspace/wordcount/output";

        // 获取运行环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        //获取文件中的内容
        DataSource<String> text = env.readTextFile(inputPath);

        //groupBy(num) :按照第几列进行排序;sum(num):排序后将第二列的值进行求和
        DataSet<Tuple2<String,Integer>> counts = text.flatMap(new Tokenizer()).groupBy(0).sum(1);
        counts.writeAsCsv(outputPath,"\n"," ").setParallelism(1);
        env.execute("batch word count");
    }

    public static class Tokenizer implements FlatMapFunction<String, Tuple2<String,Integer>>{

        @Override
        public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
            //\w 匹配字母或数字或下划线或汉字  等价于[^A-Za-z0-9_]
            //\W 非数字字母下划线
            String[] tokens = value.toLowerCase().split("\\W+");
            for (String token: tokens) {
                if (token.length() > 0) {
                    //转变成 word   1的格式。每个新的单词字数都是1
                    out.collect(new Tuple2<String,Integer>(token,1));
                }
            }
        }
    }

}

java stream对Bigdecimal汇总_flink_02

1.9.2.3.数据准备

java stream对Bigdecimal汇总_apache_03

words.txt中的内容如下:

java stream对Bigdecimal汇总_scala_04

1.9.2.4.执行结果

java stream对Bigdecimal汇总_scala_05


output的内容如下:

java stream对Bigdecimal汇总_apache_06

1.9.3.编写scala版本word-count程序

1.9.3.1.添加Flink依赖
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.toto.learn</groupId>
    <artifactId>flink-learn</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>1.11.1</version>
        </dependency>

        <!--
        1.compile   : 默认的scope,运行期有效,需要打入包中。
        2.provided  : 编译器有效,运行期不需要提供,不会打入包中。
        3.runtime   : 编译不需要,在运行期有效,需要导入包中。(接口与实现分离)
        4.test      : 测试需要,不会打入包中
        5.system    : 非本地仓库引入、存在系统的某个路径下的jar。(一般不使用)
        -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_2.12</artifactId>
            <version>1.11.1</version>
            <!--<scope>provided</scope>-->
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_2.12</artifactId>
            <version>1.11.1</version>
        </dependency>

        <!-- 使用scala编程的时候使用下面的依赖 start-->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-scala_2.12</artifactId>
            <version>1.11.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_2.12</artifactId>
            <version>1.11.1</version>
        </dependency>
        <!-- 使用scala编程的时候使用下面的依赖 end-->

    </dependencies>

    <build>
        <plugins>
            <!-- 编译插件 -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.6.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
            <!-- scala编译插件 -->
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.1.6</version>
                <configuration>
                    <scalaCompatVersion>2.11</scalaCompatVersion>
                    <scalaVersion>2.11.12</scalaVersion>
                    <encoding>UTF-8</encoding>
                </configuration>
                <executions>
                    <execution>
                        <id>compile-scala</id>
                        <phase>compile</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>test-compile-scala</id>
                        <phase>test-compile</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <!-- 打jar包插件(会包含所有依赖) -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>2.6</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <archive>
                        <manifest>
                            <!-- 可以设置jar包的入口类(可选) -->
                            <mainClass>com.toto.learn.batch.BatchWordCountJava</mainClass>
                        </manifest>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>

主要是添加了:

<!-- 使用scala编程的时候使用下面的依赖 start-->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-scala_2.12</artifactId>
    <version>1.11.1</version>
</dependency>

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-scala_2.12</artifactId>
    <version>1.11.1</version>
</dependency>
<!-- 使用scala编程的时候使用下面的依赖 end-->
1.9.3.2.编写wordcount的scala程序
import org.apache.flink.api.scala.ExecutionEnvironment

object BatchWordCountScala {

  def main(args: Array[String]): Unit = {
    val inputPath = "E:/workspace/wordcount/input"
    val outPut = "E:/workspace/wordcount/output"

    val env = ExecutionEnvironment.getExecutionEnvironment
    val text = env.readTextFile(inputPath)

    //引入隐式转换
    import org.apache.flink.api.scala._

    val counts = text.flatMap(_.toLowerCase.split("\\W"))
      .filter(_.nonEmpty)
      .map((_,1))
      .groupBy(0)
      .sum(1)

    counts.writeAsCsv(outPut,"\n"," ").setParallelism(1)
    env.execute("batch word count")
  }

}

运行程序,结果如上java版本的运行结果一样

1.9.4.Flink StreamingWindowWordCount

需求分析
手工通过socket实时产生一些单词,使用flink实时接收数据,对指定时间窗口内(例如:2秒)的数据进行聚合统计,并且把时间窗口计算的结果打印出来。
代码开发
添加对应的java依赖或者scala依赖
执行
1:在hadoop上执行nc -l 9000
2:在本机启动idea中的代码

编写程序:

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

/**
 * @author tuzuoquan
 * @version 1.0
 * @ClassName SocketWindowWordCountJava
 * @description TODO
 * @date 2020/9/11 17:08
 **/
public class SocketWindowWordCountJava {

    public static void main(String[] args) throws Exception {
        //获取需要的端口号
        int port;
        try {
            ParameterTool parameterTool = ParameterTool.fromArgs(args);
            port = parameterTool.getInt("port");
        }catch (Exception e){
            System.err.println("No port set. use default port 9000--java");
            port = 9000;
        }

        //1、获取flink的运行环境
        org.apache.flink.streaming.api.environment.StreamExecutionEnvironment env = org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.getExecutionEnvironment();

        String hostname = "127.0.0.1";
        String delimiter = "\n";
        //2、获取数据源,连接socket获取输入的数据
        DataStreamSource<String> text = env.socketTextStream(hostname, port, delimiter);

        //3、指定算子进行运算,通过flatMap将数据打平
        DataStream<WordWithCount> windowCounts = text.flatMap(new FlatMapFunction<String, WordWithCount>() {
            @Override
            public void flatMap(String value, Collector<WordWithCount> out) throws Exception {
                String[] splits = value.split("\\s");
                for (String word : splits) {
                    //默认每个单词出现1次
                    out.collect(new WordWithCount(word,1L));
                }
            }
        }).keyBy("word")    //根据key进行分组,它会找word中的值
                .timeWindow(Time.seconds(2),Time.seconds(1)) //指定时间窗口大小为2秒,指定时间间隔为1秒
                .sum("count");    //这里可以使用sum或reduce

        //4、指定数据存储位置,这里是把它打印到控制台。把数据打印到控制台并设置并行度
        windowCounts.print().setParallelism(1);
        //5、执行,并且给它起个名字。这一行代码一定要实现,否则程序不执行。
        env.execute("Socket window count");
    }

    public static class WordWithCount {
        public String word;
        public long count;
        public  WordWithCount(){}
        public WordWithCount(String word,long count){
            this.word = word;
            this.count = count;
        }

        @Override
        public String toString() {
            return "WordWithCount{" +
                    "word='" + word + '\'' +
                    ", count=" + count +
                    '}';
        }
    }
}

Scala的代码如下:

import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time

/**
  * 滑动窗口计算
  *
  * 每隔1秒统计最近2秒内的数据,打印到控制台
  *
  * Created by xxx.xxx on 2018/10/8.
  */
object SocketWindowWordCountScala {

  def main(args: Array[String]): Unit = {

    //获取socket端口号
    val port: Int = try {
      ParameterTool.fromArgs(args).getInt("port")
    }catch {
      case e: Exception => {
        System.err.println("No port set. use default port 9000--scala")
      }
        9000
    }


    //获取运行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //链接socket获取输入数据
    val text = env.socketTextStream("127.0.0.1",port,'\n')


    //解析数据(把数据打平),分组,窗口计算,并且聚合求sum

    //注意:必须要添加这一行隐式转行,否则下面的flatmap方法执行会报错
    import org.apache.flink.api.scala._

    val windowCounts = text.flatMap(line => line.split("\\s"))//打平,把每一行单词都切开
      .map(w => WordWithCount(w,1))//把单词转成word , 1这种形式
      .keyBy("word")//分组
      .timeWindow(Time.seconds(2),Time.seconds(1))//指定窗口大小,指定间隔时间
      .sum("count");// sum或者reduce都可以
    //.reduce((a,b)=>WordWithCount(a.word,a.count+b.count))

    //打印到控制台
    windowCounts.print().setParallelism(1);

    //执行任务
    env.execute("Socket window count");
  }

  case class WordWithCount(word: String,count: Long)
}

1.9.5.Flink程序开发步骤

1:获得一个执行环境
2:加载/创建 初始化数据
3:指定操作数据的transaction算子
4:指定把计算好的数据放在哪
5:调用execute()触发执行程序
注意:Flink程序是延迟计算的,只有最后调用execute()方法的时候才会真正触发执行程序。
延迟计算好处:你可以开发复杂的程序,但是Flink可以将复杂的程序转成一个Plan,将Plan作为一个整体单元执行!

1.9.6.Flink Streaming和Batch的区别 (API使用层面)

流处理Streaming
StreamExecutionEnvironment
DataStreaming
批处理Batch
ExecutionEnvironment
DataSet

1.9.7.在集群上执行程序

编译
需要在pom文件中添加build配置,打包时指定入口类全类名【或者在运行时动态指定】
provided
mvn clean package
执行
1:在flink机器上启动local flink集群
2:在flink机器上执行nc -l 9000
3: 在flink机器上执行./bin/flink run -c xxx.xxx.xxx.MainClass FlinkExample-xxxxx.jar --port 9000
4: 在flink机器上执行tail -f log/flink--taskexecutor-.out查看日志输出
5: 停止任务
A:web ui界面停止
B:命令执行bin/flink cancel

1.9.8.集群跑jar的时候pom文件中需要进行的build配置

<build>
        <plugins>
            <!-- 编译插件 -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.6.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
            <!-- scala编译插件 -->
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.1.6</version>
                <configuration>
                    <scalaCompatVersion>2.11</scalaCompatVersion>
                    <scalaVersion>2.11.12</scalaVersion>
                    <encoding>UTF-8</encoding>
                </configuration>
                <executions>
                    <execution>
                        <id>compile-scala</id>
                        <phase>compile</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>test-compile-scala</id>
                        <phase>test-compile</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <!-- 打jar包插件(会包含所有依赖) -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>2.6</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <archive>
                        <manifest>
                            <!-- 可以设置jar包的入口类(可选) -->
                            <mainClass>xxx.xxx.SocketWindowWordCountJava</mainClass>
                        </manifest>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>