start-cluster.sh
我们先来看看start-cluster.sh这个脚本
```bin=`dirname "$0"`
bin=`cd "$bin"; pwd`
# 先调用config.sh读取配置文件
. "$bin"/config.sh
# Start the JobManager instance(s)
# 启动JobManager,分为HA模式和单机模式
shopt -s nocasematch
# 这里的HIGH_AVAILABILITY就是flink.yaml中配置的high-availability
if [[ $HIGH_AVAILABILITY == "zookeeper" ]]; then
# HA Mode
readMasters
echo "Starting HA cluster with ${#MASTERS[@]} masters."
for ((i=0;i<${#MASTERS[@]};++i)); do
master=${MASTERS[i]}
webuiport=${WEBUIPORTS[i]}
if [ ${MASTERS_ALL_LOCALHOST} = true ] ; then
"${FLINK_BIN_DIR}"/jobmanager.sh start "${master}" "${webuiport}"
else
ssh -n $FLINK_SSH_OPTS $master -- "nohup /bin/bash -l \"${FLINK_BIN_DIR}/jobmanager.sh\" start ${master} ${webuiport} &"
fi
done
else
echo "Starting cluster."
# Start single JobManager on this machine
"$FLINK_BIN_DIR"/jobmanager.sh start
fi
shopt -u nocasematch
# Start TaskManager instance(s)
# 启动JobManager之后启动TaskManager,这个方法在config.sh中
TMWorkers start
这个脚本相对较简单,总体流程就是先调用config.sh加载配置文件,
然后判断我们配置的是HA模式还是单机模式,分别执行不同的逻辑,
对应调用jobmanager.sh传不同的参数,启动jobmanager,
最后启动TashManager。
**config.sh**
这个脚本的作用就是加载各种配置信息如JAVA配置,FLINK的配置,HADOOP的配置等等。其中有一点要注意的是TaskManager的启动方法也在这个脚本中
....
# read JAVA_HOME from config with no default value
MY_JAVA_HOME=$(readFromConfig ${KEY_ENV_JAVA_HOME} "" "${YAML_CONF}")
# check if config specified JAVA_HOME
if [ -z "${MY_JAVA_HOME}" ]; then
# config did not specify JAVA_HOME. Use system JAVA_HOME
MY_JAVA_HOME=${JAVA_HOME}
fi
# check if we have a valid JAVA_HOME and if java is not available
if [ -z "${MY_JAVA_HOME}" ] && ! type java > /dev/null 2> /dev/null; then
echo "Please specify JAVA_HOME. Either in Flink config ./conf/flink-conf.yaml or as system-wide JAVA_HOME."
exit 1
else
JAVA_HOME=${MY_JAVA_HOME}
fi
UNAME=$(uname -s)
if [ "${UNAME:0:6}" == "CYGWIN" ]; then
JAVA_RUN=java
else
if [[ -d $JAVA_HOME ]]; then
JAVA_RUN=$JAVA_HOME/bin/java
else
JAVA_RUN=java
fi
fi
......
# Verify that NUMA tooling is available
command -v numactl >/dev/null 2>&1
if [[ $? -ne 0 ]]; then
FLINK_TM_COMPUTE_NUMA="false"
else
# Define FLINK_TM_COMPUTE_NUMA if it is not already set
if [ -z "${FLINK_TM_COMPUTE_NUMA}" ]; then
FLINK_TM_COMPUTE_NUMA=$(readFromConfig ${KEY_TASKM_COMPUTE_NUMA} "false" "${YAML_CONF}")
fi
fi
if [ -z "${MAX_LOG_FILE_NUMBER}" ]; then
MAX_LOG_FILE_NUMBER=$(readFromConfig ${KEY_ENV_LOG_MAX} ${DEFAULT_ENV_LOG_MAX} "${YAML_CONF}")
fi
if [ -z "${FLINK_LOG_DIR}" ]; then
FLINK_LOG_DIR=$(readFromConfig ${KEY_ENV_LOG_DIR} "${DEFAULT_FLINK_LOG_DIR}" "${YAML_CONF}")
fi
if [ -z "${YARN_CONF_DIR}" ]; then
YARN_CONF_DIR=$(readFromConfig ${KEY_ENV_YARN_CONF_DIR} "${DEFAULT_YARN_CONF_DIR}" "${YAML_CONF}")
fi
if [ -z "${HADOOP_CONF_DIR}" ]; then
HADOOP_CONF_DIR=$(readFromConfig ${KEY_ENV_HADOOP_CONF_DIR} "${DEFAULT_HADOOP_CONF_DIR}" "${YAML_CONF}")
fi
if [ -z "${FLINK_PID_DIR}" ]; then
FLINK_PID_DIR=$(readFromConfig ${KEY_ENV_PID_DIR} "${DEFAULT_ENV_PID_DIR}" "${YAML_CONF}")
fi
if [ -z "${FLINK_ENV_JAVA_OPTS}" ]; then
FLINK_ENV_JAVA_OPTS=$(readFromConfig ${KEY_ENV_JAVA_OPTS} "${DEFAULT_ENV_JAVA_OPTS}" "${YAML_CONF}")
# Remove leading and ending double quotes (if present) of value
FLINK_ENV_JAVA_OPTS="$( echo "${FLINK_ENV_JAVA_OPTS}" | sed -e 's/^"//' -e 's/"$//' )"
fi
.....
# Check if deprecated HADOOP_HOME is set, and specify config path to HADOOP_CONF_DIR if it's empty.
if [ -z "$HADOOP_CONF_DIR" ]; then
if [ -n "$HADOOP_HOME" ]; then
# HADOOP_HOME is set. Check if its a Hadoop 1.x or 2.x HADOOP_HOME path
if [ -d "$HADOOP_HOME/conf" ]; then
# It's Hadoop 1.x
HADOOP_CONF_DIR="$HADOOP_HOME/conf"
fi
if [ -d "$HADOOP_HOME/etc/hadoop" ]; then
# It's Hadoop 2.2+
HADOOP_CONF_DIR="$HADOOP_HOME/etc/hadoop"
fi
fi
fi
......
# starts or stops TMs on all workers
# TMWorkers start|stop
TMWorkers() {
CMD=$1
readWorkers
# 执行taskmanager.sh
if [ ${WORKERS_ALL_LOCALHOST} = true ] ; then
# all-local setup
for worker in ${WORKERS[@]}; do
"${FLINK_BIN_DIR}"/taskmanager.sh "${CMD}"
done
else
# non-local setup
# start/stop TaskManager instance(s) using pdsh (Parallel Distributed Shell) when available
command -v pdsh >/dev/null 2>&1
if [[ $? -ne 0 ]]; then
for worker in ${WORKERS[@]}; do
ssh -n $FLINK_SSH_OPTS $worker -- "nohup /bin/bash -l \"${FLINK_BIN_DIR}/taskmanager.sh\" \"${CMD}\" &"
done
else
PDSH_SSH_ARGS="" PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS pdsh -w $(IFS=, ; echo "${WORKERS[*]}") \
"nohup /bin/bash -l \"${FLINK_BIN_DIR}/taskmanager.sh\" \"${CMD}\""
fi
fi
}
可以看到config.sh这个脚本加载了许多配置,
这些配置在启动JM或TM的时候都需要用到。
加载完配置信息之后下面我们看看如何启动JM
**
## jobManager.sh
**
在start-cluster.sh脚本中继续往下走,加载完配置信息后,
开始启动JM,JM这里判断是否是HA模式,判断的方式很简单,
就是根据我们在配置文件中high-availability是否设置为zookeeper,如果是HA模式,
则读取masters文件,看看都需要启动哪些节点,
如果是本机启动则直接调用jobmanager.sh脚本,
如果不是本机则使用ssh远程执行脚本启动。
下面我们看看Jobmanager.sh中是如何实现的。
# Start/stop a Flink JobManager.
USAGE="Usage: jobmanager.sh ((start|start-foreground) [host] [webui-port])|stop|stop-all"
STARTSTOP=$1
HOST=$2 # optional when starting multiple instances
WEBUIPORT=$3 # optional when starting multiple instances
if [[ $STARTSTOP != "start" ]] && [[ $STARTSTOP != "start-foreground" ]] && [[ $STARTSTOP != "stop" ]] && [[ $STARTSTOP != "stop-all" ]]; then
echo $USAGE
exit 1
fi
bin=`dirname "$0"`
bin=`cd "$bin"; pwd`
. "$bin"/config.sh
# 使用jobmanager.sh启动的,默认就是standalonesession模式
ENTRYPOINT=standalonesession
if [[ $STARTSTOP == "start" ]] || [[ $STARTSTOP == "start-foreground" ]]; then
# Add JobManager-specific JVM options
export FLINK_ENV_JAVA_OPTS="${FLINK_ENV_JAVA_OPTS} ${FLINK_ENV_JAVA_OPTS_JM}"
parseJmJvmArgsAndExportLogs "${ARGS[@]}"
args=("--configDir" "${FLINK_CONF_DIR}" "--executionMode" "cluster")
if [ ! -z $HOST ]; then
args+=("--host")
args+=("${HOST}")
fi
if [ ! -z $WEBUIPORT ]; then
args+=("--webui-port")
args+=("${WEBUIPORT}")
fi
fi
if [[ $STARTSTOP == "start-foreground" ]]; then
exec "${FLINK_BIN_DIR}"/flink-console.sh $ENTRYPOINT "${args[@]}"
else
"${FLINK_BIN_DIR}"/flink-daemon.sh $STARTSTOP $ENTRYPOINT "${args[@]}"
fi
可以看到Jobmanager.sh这个脚本也是相对较简单,
在开头的地方同样是执行了config.sh脚本加载配置信息,
然后将一些配置信息组成args变量和ENTRYPOINT作为参数传给flink-daemon.sh脚本,
这里要注意的是ENTRYPOINT的值为standalonesession,
在flink-daemon.sh将作为判断依据。
## flink-daemon.sh
从脚本名就可以看出这个脚本不简单,守护进程的启动脚本,
可以看出这是一个总入口,根据传进来的参数启动对应的java进程,
看看它是怎么完成的。
. "$bin"/config.sh
case $DAEMON in
# 启动TaskManager对应到这
(taskexecutor)
CLASS_TO_RUN=org.apache.flink.runtime.taskexecutor.TaskManagerRunner
;;
(zookeeper)
CLASS_TO_RUN=org.apache.flink.runtime.zookeeper.FlinkZooKeeperQuorumPeer
;;
(historyserver)
CLASS_TO_RUN=org.apache.flink.runtime.webmonitor.history.HistoryServer
;;
# jobmanager.sh中的DAEMON对应到这。
(standalonesession)
CLASS_TO_RUN=org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint
;;
(standalonejob)
CLASS_TO_RUN=org.apache.flink.container.entrypoint.StandaloneApplicationClusterEntryPoint
;;
(*)
echo "Unknown daemon '${DAEMON}'. $USAGE."
exit 1
;;
esac
开头老规矩,同样是先加载配置信息,
然后根据我们前面的传参匹配启动的java类。
我们前面传入的是standalonesession
因此匹配到了org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint。
接下来就是准备一些日志,java参数
FLINK_LOG_PREFIX="${FLINK_LOG_DIR}/flink-${FLINK_IDENT_STRING}-${DAEMON}-${id}-${HOSTNAME}"
log="${FLINK_LOG_PREFIX}.log"
out="${FLINK_LOG_PREFIX}.out"
log_setting=("-Dlog.file=${log}" "-Dlog4j.configuration=file:${FLINK_CONF_DIR}/log4j.properties" "-Dlog4j.configurationFile=file:${FLINK_CONF_DIR}/log4j.properties" "-Dlogback.configurationFile=file:${FLINK_CONF_DIR}/logback.xml")
JAVA_VERSION=$(${JAVA_RUN} -version 2>&1 | sed 's/.*version "\(.*\)\.\(.*\)\..*"/\1\2/; 1q')
# Only set JVM 8 arguments if we have correctly extracted the version
if [[ ${JAVA_VERSION} =~ ${IS_NUMBER} ]]; then
if [ "$JAVA_VERSION" -lt 18 ]; then
JVM_ARGS="$JVM_ARGS -XX:MaxPermSize=256m"
fi
fi
然后判断我们传进来的参数是start,stop,还是stop-all,
如果是start,则执行前面匹配到的启动类,并将前面所准备的参数传入。
(start)
# Rotate log files
rotateLogFilesWithPrefix "$FLINK_LOG_DIR" "$FLINK_LOG_PREFIX"
# Print a warning if daemons are already running on host
if [ -f "$pid" ]; then
active=()
while IFS='' read -r p || [[ -n "$p" ]]; do
kill -0 $p >/dev/null 2>&1
if [ $? -eq 0 ]; then
active+=($p)
fi
done < "${pid}"
count="${#active[@]}"
if [ ${count} -gt 0 ]; then
echo "[INFO] $count instance(s) of $DAEMON are already running on $HOSTNAME."
fi
fi
# Evaluate user options for local variable expansion
FLINK_ENV_JAVA_OPTS=$(eval echo ${FLINK_ENV_JAVA_OPTS})
echo "Starting $DAEMON daemon on host $HOSTNAME."
# 执行main方法
$JAVA_RUN $JVM_ARGS ${FLINK_ENV_JAVA_OPTS} "${log_setting[@]}" -classpath "`manglePathList "$FLINK_TM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`" ${CLASS_TO_RUN} "${ARGS[@]}" > "$out" 200<&- 2>&1 < /dev/null &
mypid=$!
# Add to pid file if successful start
if [[ ${mypid} =~ ${IS_NUMBER} ]] && kill -0 $mypid > /dev/null 2>&1 ; then
echo $mypid >> "$pid"
else
echo "Error starting $DAEMON daemon."
exit 1
fi
;;
JM启动的java类
/**
* Entry point for the standalone session cluster.
*/
public class StandaloneSessionClusterEntrypoint extends SessionClusterEntrypoint {
public static void main(String[] args) {
// startup checks and logging
//打印有关环境的信息
EnvironmentInformation.logEnvironmentInfo(LOG, StandaloneSessionClusterEntrypoint.class.getSimpleName(), args);
//注册一些信号处理
SignalHandler.register(LOG);
//安装安全关闭的钩子
JvmShutdownSafeguard.installAsShutdownHook(LOG);
EntrypointClusterConfiguration entrypointClusterConfiguration = null;
final CommandLineParser<EntrypointClusterConfiguration> commandLineParser = new CommandLineParser<>(new EntrypointClusterConfigurationParserFactory());
try {
//对传入的参数进行解析
//内部通过EntrypointClusterConfigurationParserFactory解析配置文件,返回EntrypointClusterConfiguration为ClusterConfiguration的子类
entrypointClusterConfiguration = commandLineParser.parse(args);
} catch (FlinkParseException e) {
LOG.error("Could not parse command line arguments {}.", args, e);
commandLineParser.printHelp(StandaloneSessionClusterEntrypoint.class.getSimpleName());
System.exit(1);
}
Configuration configuration = loadConfiguration(entrypointClusterConfiguration);
//创建了StandaloneSessionClusterEntrypoint
StandaloneSessionClusterEntrypoint entrypoint = new StandaloneSessionClusterEntrypoint(configuration);
//启动集群的entrypoint。
//这个方法接受的是父类ClusterEntrypoint,可想而知其他几种启动方式也是通过这个方法。
ClusterEntrypoint.runClusterEntrypoint(entrypoint);
}
}
jm和tm的启动今天先看到这里,后面会有专门的文章讲解。
让我们回到start-cluster.sh这个脚本,启动完jobmanager之后就是启动taskmanager,
taskmanager也是同样的启动方式,最后对应到flink-daemon.sh,
启动类为org.apache.flink.runtime.taskexecutor.TaskManagerRunner。java启动类如下:
/**
* This class is the executable entry point for the task manager in yarn or standalone mode.
* It constructs the related components (network, I/O manager, memory manager, RPC service, HA service)
* and starts them.
*/
public class TaskManagerRunner implements FatalErrorHandler, AutoCloseableAsync {
// --------------------------------------------------------------------------------------------
// Static entry point
// --------------------------------------------------------------------------------------------
public static void main(String[] args) throws Exception {
// startup checks and logging
EnvironmentInformation.logEnvironmentInfo(LOG, "TaskManager", args);
SignalHandler.register(LOG);
JvmShutdownSafeguard.installAsShutdownHook(LOG);
long maxOpenFileHandles = EnvironmentInformation.getOpenFileHandlesLimit();
if (maxOpenFileHandles != -1L) {
LOG.info("Maximum number of open file descriptors is {}.", maxOpenFileHandles);
} else {
LOG.info("Cannot determine the maximum number of open file descriptors");
}
runTaskManagerSecurely(args, ResourceID.generate());
}
}
我们用一幅图整体浏览一下Standalone Session启动第一个环节都涉及到了哪些步骤。
