1、简单粗暴,flink-daemon.sh脚本可知taskmanager执行类为:org.apache.flink.runtime.taskmanager.TaskManager
2、main方法里面,最主要的就是启动taskmanager
try {
SecurityUtils.getInstalledContext.runSecured(new Callable[Unit] {
override def call(): Unit = {
//运行taskmanager,记住classOf[TaskManager],这是taksManagerActor的启动类,生命周期方法在此类中
selectNetworkInterfaceAndRunTaskManager(configuration, resourceId, classOf[TaskManager])
}
})
}
3、selectNetworkInterfaceAndRunTaskManager里面主要做了三件事:
a、创建高可用服务
b、给taskmanager分配主机、端口范围
c、启动taskmanager
def selectNetworkInterfaceAndRunTaskManager(
configuration: Configuration,
resourceID: ResourceID,
taskManagerClass: Class[_ <: TaskManager])
: Unit = {
val highAvailabilityServices = HighAvailabilityServicesUtils.createHighAvailabilityServices(
configuration,
Executors.directExecutor(),
AddressResolution.TRY_ADDRESS_RESOLUTION)
//选择网络接口和端口范围
val (taskManagerHostname, actorSystemPortRange) = selectNetworkInterfaceAndPortRange(
configuration,
highAvailabilityServices)
try {
//启动taksmanager
runTaskManager(
taskManagerHostname,
resourceID,
actorSystemPortRange,
configuration,
highAvailabilityServices,
taskManagerClass)
} finally {
try {
highAvailabilityServices.close()
} catch {
case t: Throwable => LOG.warn("Could not properly stop the high availability services.", t)
}
}
}
4、进入runTaskManager方法,里面主要是根据上面分配的端口范围,找到可用的端口分配给taskmanager通信使用,然后调用重载的runTaskManager方法启动taskmanager
def runTaskManager(
taskManagerHostname: String,
resourceID: ResourceID,
actorSystemPortRange: java.util.Iterator[Integer],
configuration: Configuration,
highAvailabilityServices: HighAvailabilityServices,
taskManagerClass: Class[_ <: TaskManager])
: Unit = {
//通过创建socket,找到可用的端口
val result = AkkaUtils.retryOnBindException({
// Try all ports in the range until successful
val socket = NetUtils.createSocketFromPorts(
actorSystemPortRange,
new NetUtils.SocketFactory {
override def createSocket(port: Int): ServerSocket = new ServerSocket(
// Use the correct listening address, bound ports will only be
// detected later by Akka.
port, 0, InetAddress.getByName(NetUtils.getWildcardIPAddress))
})
val port =
if (socket == null) {
throw new BindException(s"Unable to allocate port for TaskManager.")
} else {
try {
socket.getLocalPort()
} finally {
socket.close()
}
}
runTaskManager(
taskManagerHostname,
resourceID,
port,
configuration,
highAvailabilityServices,
taskManagerClass)
}, { !actorSystemPortRange.hasNext }, 5000)
result match {
case scala.util.Failure(f) => throw f
case _ =>
}
}
5、进入重载的runTaskManager
5.1、创建一个taskManagerActorSystem
val taskManagerSystem = BootstrapTools.startActorSystem(
configuration,
taskManagerHostname,
actorSystemPort,
LOG.logger)
5.2、创建一个MetricRegistry,并启动初始化服务
val metricRegistry = new MetricRegistryImpl(
MetricRegistryConfiguration.fromConfiguration(configuration))
metricRegistry.startQueryService(taskManagerSystem, resourceID)
5.3、启动taskmanager组件和taskmanagerActor
val taskManager = startTaskManagerComponentsAndActor(
configuration,
resourceID,
taskManagerSystem,
highAvailabilityServices,
metricRegistry,
taskManagerHostname,
Some(TaskExecutor.TASK_MANAGER_NAME),
localTaskManagerCommunication = false,
taskManagerClass)
5.3.1、启动taskmanagerActor后,进入生命周期方法prestart,里面主要就是启动了一个检索leader jobmanager的检索器,因为是standalone模式,所以直接告知leader jobmanager地址
leaderRetrievalService.start(this)
//查看StandaloneLeaderRetrievalService的start方法
public void start(LeaderRetrievalListener listener) {
checkNotNull(listener, "Listener must not be null.");
synchronized (startStopLock) {
checkState(!started, "StandaloneLeaderRetrievalService can only be started once.");
started = true;
// 直接通知监听器,告知leader jobmanager地址
listener.notifyLeaderAddress(leaderAddress, leaderId);
}
}
5.3.2 进入taskmanager的notifyLeaderAddress方法,里面给taskmanagerActor发送了JobManagerLeaderAddress消息
override def notifyLeaderAddress(leaderAddress: String, leaderSessionID: UUID): Unit = {
self ! JobManagerLeaderAddress(leaderAddress, leaderSessionID)
}
5.3.3 进入taskmanagerActor的handleMessage方法,找到JobManagerLeaderAddress,处理逻辑如下:
1、如果taskmanager中已存储的有leader jobmanager地址(即已经与一个leader jobmanager保持着连接),则先与旧的leader jobmanager断开连接
2、触发taskmanager到jobmanager中注册
case JobManagerLeaderAddress(address, newLeaderSessionID) =>
handleJobManagerLeaderAddress(address, newLeaderSessionID)
private def handleJobManagerLeaderAddress(
newJobManagerAkkaURL: String,
leaderSessionID: UUID)
: Unit = {
currentJobManager match {
case Some(jm) =>
Option(newJobManagerAkkaURL) match {
case Some(newJMAkkaURL) =>
//与旧的leader jobmanager断开连接
handleJobManagerDisconnect(s"JobManager $newJMAkkaURL was elected as leader.")
case None =>
handleJobManagerDisconnect(s"Old JobManager lost its leadership.")
}
case None =>
}
this.jobManagerAkkaURL = Option(newJobManagerAkkaURL)
this.leaderSessionID = Option(leaderSessionID)
if (this.leaderSessionID.isDefined) {
// 触发taskmanager注册
triggerTaskManagerRegistration()
}
}
5.3.4 给taskmanagerActor发送一个注册消息TriggerTaskManagerRegistration
self ! decorateMessage(
TriggerTaskManagerRegistration(
jobManagerAkkaURL.get,
new FiniteDuration(
config.getInitialRegistrationPause().getSize(),
config.getInitialRegistrationPause().getUnit()),
deadline,
1,
currentRegistrationRun)
)
5.3.5 注册逻辑:
case message: RegistrationMessage => handleRegistrationMessage(message)
5.3.5.1、如果已经注册过,打印日志
if (isConnected) {
// this may be the case, if we queue another attempt and
// in the meantime, the registration is acknowledged
log.debug(
"TaskManager was triggered to register at JobManager, but is already registered")
}
5.3.5.2、如果在指定直接内没有注册成功则放弃注册
else if (deadline.exists(_.isOverdue())) {
// we failed to register in time. that means we should quit
log.error("Failed to register at the JobManager within the defined maximum " +
"connect time. Shutting down ...")
// terminate ourselves (hasta la vista)
self ! decorateMessage(PoisonPill)
}
5.3.5.3、向jobmanagerActor发送注册消息
val jobManager = context.actorSelection(jobManagerURL)
jobManager ! decorateMessage(
RegisterTaskManager(
resourceID,
location,
resources,
numberOfSlots)
)
5.3.5.3.1 jobmanagerActor收到taskmanager的注册消息(jobmanager.handleMessage方法中),如果resourcemanager已经在jobmanager中注册,则通知resourcemanager在给定的资源容器中启动taskmanager(同步通信),如果resourcemanager启动正常,则回一个确认该taskmanager已经资源注册的消息
currentResourceManager match {
case Some(rm) =>
val future = (rm ? decorateMessage(new NotifyResourceStarted(msg.resourceId)))(timeout)
future.onFailure {
case t: Throwable =>
t match {
case _: TimeoutException =>
log.info("Attempt to register resource at ResourceManager timed out. Retrying")
case _ =>
log.warn("Failure while asking ResourceManager for RegisterResource. Retrying", t)
}
self ! decorateMessage(
new ReconnectResourceManager(
rm,
currentResourceManagerConnectionId))
}(context.dispatcher)
case None =>
log.info("Task Manager Registration but not connected to ResourceManager")
}
5.3.5.3.2 如果已经注册过了,则发消息给taskmanagerActor,表示该taskmanager已经存在了
if (instanceManager.isRegistered(resourceId)) {
val instanceID = instanceManager.getRegisteredInstance(resourceId).getId
taskManager ! decorateMessage(
AlreadyRegistered(
instanceID,
blobServer.getPort))
}
5.3.5.3.3 如果没有注册过,则注册,并返回确认注册的消息给taskmanagerActor
taskManager ! decorateMessage(
AcknowledgeRegistration(instanceID, blobServer.getPort))
5.3.5.3.3.1 taskmanagerActor在接收到反馈的消息后主要做了几件事:
1、启动了BLOB缓存
2、监听jobmanager,在jobmanager挂掉后能及时知道
3、启动和jobmanager直接的心跳机制
5.3.5.3.4 监听改注册的taskmanagerActor,taskmanager挂掉后能及时知道
context.watch(taskManager)
5.3.5.4 定义一个指定时间后注册的定时调度任务,防止因为网络等原因没有注册上,类似递归操作,一直到注册成功或者超过指定的注册截止日期放弃为止。
val nextTimeout = (timeout * 2).min(new FiniteDuration(
config.getMaxRegistrationPause().toMilliseconds,
TimeUnit.MILLISECONDS))
// schedule a check to trigger a new registration attempt if not registered
// by the timeout
scheduledTaskManagerRegistration = Option(context.system.scheduler.scheduleOnce(
timeout,
self,
decorateMessage(TriggerTaskManagerRegistration(
jobManagerURL,
nextTimeout,
deadline,
attempt + 1,
registrationRun)
))(context.dispatcher))
5.4、启动一个taskmanagerActor监测,在taskmanagerActor挂掉后kill掉JVM进程
taskManagerSystem.actorOf(
Props(classOf[ProcessReaper], taskManager, LOG.logger, RUNTIME_FAILURE_RETURN_CODE),
"TaskManager_Process_Reaper")
附上时序图: