一、简介

flink 自定义实时数据源使用流处理比较简单,比如 Kafka、MQ 等,如果使用 MySQL、redis 批处理也比较简单

如果需要定时加载数据作为 flink 数据源使用流处理,比如定时从 mysql 或者 redis 获取一批数据,传入 flink 做处理,如下简单实现

二、pom.xml 文件

注意 flink 好多包从 1.15.0 开始不需要指定 Scala 版本,内部自带

Flink on yarn任务每个节点只执行一个 flink 任务 定时执行_apache

下面 pom 文件有 flink 两个版本 1.16.0 和 1.12.7(Scala:2.12)

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.ye</groupId>
    <artifactId>flink-study</artifactId>
    <version>0.1</version>
    <packaging>jar</packaging>

    <name>Flink Quickstart Job</name>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <flink.version>1.16.0</flink.version>
        <!--<flink.version>1.12.7</flink.version>-->
        <target.java.version>1.8</target.java.version>
        <scala.binary.version>2.12</scala.binary.version>
        <maven.compiler.source>${target.java.version}</maven.compiler.source>
        <maven.compiler.target>${target.java.version}</maven.compiler.target>
        <log4j.version>2.17.1</log4j.version>
    </properties>

    <repositories>
        <repository>
            <id>apache.snapshots</id>
            <name>Apache Development Snapshot Repository</name>
            <url>https://repository.apache.org/content/repositories/snapshots/</url>
            <releases>
                <enabled>false</enabled>
            </releases>
            <snapshots>
                <enabled>true</enabled>
            </snapshots>
        </repository>
    </repositories>

    <dependencies>
        <!-- Apache Flink dependencies -->
        <!-- These dependencies are provided, because they should not be packaged into the JAR file. -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
<!--
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
-->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-files</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <!-- Add connector dependencies here. They must be in the default scope (compile). -->

        <!-- Example:

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka</artifactId>
            <version>${flink.version}</version>
        </dependency>
        -->

        <!-- Add logging framework, to produce console output when running in the IDE. -->
        <!-- These dependencies are excluded from the application JAR by default. -->
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-slf4j-impl</artifactId>
            <version>${log4j.version}</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-api</artifactId>
            <version>${log4j.version}</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-core</artifactId>
            <version>${log4j.version}</version>
            <scope>runtime</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>

            <!-- Java Compiler -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>${target.java.version}</source>
                    <target>${target.java.version}</target>
                </configuration>
            </plugin>

            <!-- We use the maven-shade plugin to create a fat jar that contains all necessary dependencies. -->
            <!-- Change the value of <mainClass>...</mainClass> if your program entry point changes. -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.1.1</version>
                <executions>
                    <!-- Run shade goal on package phase -->
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                            <artifactSet>
                                <excludes>
                                    <exclude>org.apache.flink:flink-shaded-force-shading</exclude>
                                    <exclude>com.google.code.findbugs:jsr305</exclude>
                                    <exclude>org.slf4j:*</exclude>
                                    <exclude>org.apache.logging.log4j:*</exclude>
                                </excludes>
                            </artifactSet>
                            <filters>
                                <filter>
                                    <!-- Do not copy the signatures in the META-INF folder.
                                    Otherwise, this might cause SecurityExceptions when using the JAR. -->
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers>
                                <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                                <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>com.ye.DataStreamJob</mainClass>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>

        <pluginManagement>
            <plugins>

                <!-- This improves the out-of-the-box experience in Eclipse by resolving some warnings. -->
                <plugin>
                    <groupId>org.eclipse.m2e</groupId>
                    <artifactId>lifecycle-mapping</artifactId>
                    <version>1.0.0</version>
                    <configuration>
                        <lifecycleMappingMetadata>
                            <pluginExecutions>
                                <pluginExecution>
                                    <pluginExecutionFilter>
                                        <groupId>org.apache.maven.plugins</groupId>
                                        <artifactId>maven-shade-plugin</artifactId>
                                        <versionRange>[3.1.1,)</versionRange>
                                        <goals>
                                            <goal>shade</goal>
                                        </goals>
                                    </pluginExecutionFilter>
                                    <action>
                                        <ignore/>
                                    </action>
                                </pluginExecution>
                                <pluginExecution>
                                    <pluginExecutionFilter>
                                        <groupId>org.apache.maven.plugins</groupId>
                                        <artifactId>maven-compiler-plugin</artifactId>
                                        <versionRange>[3.1,)</versionRange>
                                        <goals>
                                            <goal>testCompile</goal>
                                            <goal>compile</goal>
                                        </goals>
                                    </pluginExecutionFilter>
                                    <action>
                                        <ignore/>
                                    </action>
                                </pluginExecution>
                            </pluginExecutions>
                        </lifecycleMappingMetadata>
                    </configuration>
                </plugin>
            </plugins>
        </pluginManagement>
    </build>
</project>
三、自定义数据源

使用 Timer 定时任务(当然也可以使用线程池 Executors)自定义数据源,每过五秒随机生成一串字符串

public class TimerSinkRich extends RichSourceFunction<String> {

    private  ConcurrentLinkedQueue<String> queue = new ConcurrentLinkedQueue<>();

    private boolean flag = true;
    private Timer timer;
    private TimerTask timerTask;

    @Override
    public void open(Configuration parameters) throws Exception {
        super.open(parameters);
        timerTask = new TimerTask() {
            @Override
            public void run() {
            // 可以在这块获取 MySQL、redis 等连接并查询数据
                Random random = new Random();
                StringBuilder str = new StringBuilder();
                for (int i = 0; i < 10; i++) {
                    char ranLowLetter = (char) ((random.nextInt(26) + 97));
                    str.append(ranLowLetter);
                }
                queue.add(str.toString());
            }
        };
        timer = new Timer();
        // 延时和执行周期参数可以通过构造方法传递
        timer.schedule(timerTask,1000,5000);
    }

    @Override
    public void run(SourceContext<String> ctx) throws Exception {
        while (flag){
            if(queue.size()>0){
                ctx.collect(queue.remove());
            }
        }
    }


    @Override
    public void cancel() {
        if(null!=timer) timer.cancel();
        if(null!=timerTask) timerTask.cancel();
        // 撤销任务时,flink 默认 180s 间隔 30s 执行一次 cancel 方法,(不同 flink 版本可能不同)尝试关闭数据源,关闭失败 TaskManager 不能释放 slot,最终导致失败
        if(queue.size()<=0) flag = false;
    }
}

2023-07-21 日更新
上面的有一个很大的问题就是 while(true),会导致 CPU 过高,排查 CPU 过高问题,可参考我另一篇文章

目前我的解决办法是在 while(true) 里面睡眠可大幅度降低 CPU 使用率,避免空跑,将上面的 run 方法改为如下即可,其中 Thread.sleep(100) 睡眠时间可根据数据量密度自行改变

@Override
    public void run(SourceContext<String> ctx) throws Exception {
         while (true) {
            Thread.sleep(100);
            String data;
            while (null != (data = queue.poll())) {
                ctx.collect(data );
            }
        }
    }

建议:对于上面的 cancel 方法,是否需要使用 if(queue.size()<=0) 去判断需自行决断

  • 如果加该判断,假如默认时间该队列 queue 未消费完,可能导致任务关闭失败,可以修改 flink 配置修改默认 cancel 任务时间
  • 如果不加该判断,会导致如果队列还有数据,会直接关闭任务丢弃 queue 未处理数据

Flink on yarn任务每个节点只执行一个 flink 任务 定时执行_apache_02

四、flink 加载数据源并启动
public class TimerSinkStreamJob {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
        executionEnvironment.setParallelism(1);
        DataStreamSource<String> streamSource = executionEnvironment.addSource(new TimerSinkRich());
        streamSource.print();
        executionEnvironment.execute("TimerSinkStreamJob 定时任务打印数据");
    }
}

本地测试成功

Flink on yarn任务每个节点只执行一个 flink 任务 定时执行_flink_03

五、上传 flink 集群
1、flink 1.16.0

启动成功

Flink on yarn任务每个节点只执行一个 flink 任务 定时执行_flink_04


撤销任务成功

Flink on yarn任务每个节点只执行一个 flink 任务 定时执行_kafka_05


solt 也成功释放

Flink on yarn任务每个节点只执行一个 flink 任务 定时执行_flink_06

2、flink 1.12.7

启动成功

Flink on yarn任务每个节点只执行一个 flink 任务 定时执行_apache_07


撤销任务当然也没问题,同样能正常释放 slot

Flink on yarn任务每个节点只执行一个 flink 任务 定时执行_flink_08

当然你也可以不要 open() 方法

public class DiySinkRich extends RichSourceFunction<String> {

    private TimerTask timerTask;
    private Timer timer;
    private boolean flag = true;
    private ConcurrentLinkedQueue<String> queue = new ConcurrentLinkedQueue<>();

    @Override
    public void run(SourceFunction.SourceContext<String> ctx) throws Exception {
        timerTask = new TimerTask() {
            @Override
            public void run() {
                Random random = new Random();
                StringBuilder str = new StringBuilder();
                for (int i = 0; i < 10; i++) {
                    char ranLowLetter = (char) ((random.nextInt(26) + 97));
                    str.append(ranLowLetter);
                }
                queue.add(str.toString());
            }
        };
        timer = new Timer();
        timer.schedule(timerTask, 1000, 5000);
        while (flag) {
            if (queue.size() > 0) {
                ctx.collect(queue.remove());
            }
        }
    }


    @Override
    public void cancel() {
        if (timer != null) timer.cancel();
        if (timerTask != null) timerTask.cancel();
        if (queue.size() == 0) flag = false;
    }
}

以上就是 flink 定时加载数据源的简单实例