0x06 RM调度-MR任务提交-服务端分析

上文我们提到过,Yarn中Client和RM交互的协议是ApplicationClientProtocol,我们已经分析过这一协议在客户端的实现ApplicationClientProtocolPBClientImpl,这一章节我们就从分析这一协议服务端的实现ClientRMService开始。

6.1 获取JobID

6.2.1 ClientRMService

我们来看看到底这个applicationId是怎么生成的:

@Override
  public GetNewApplicationResponse getNewApplication(
      GetNewApplicationRequest request) throws YarnException {
    GetNewApplicationResponse response = recordFactory
        .newRecordInstance(GetNewApplicationResponse.class);
    // 这句话是关键 ,调用了getNewApplicationId方法
    response.setApplicationId(getNewApplicationId());
    // 前面说过,GetNewApplicationResponse还包含集群的当前容量信息
    response.setMaximumResourceCapability(scheduler
        .getMaximumResourceCapability());       
    return response;
  }

下面看看getNewApplicationId方法:

ApplicationId getNewApplicationId() {
	 // 这里就是生成ID的代码,可以看到传入了时间戳和一个自增的原applicationCounter
    ApplicationId applicationId = org.apache.hadoop.yarn.server.utils.BuilderUtils
        .newApplicationId(recordFactory, ResourceManager.getClusterTimeStamp(),
            applicationCounter.incrementAndGet());
    LOG.info("Allocated new applicationId: " + applicationId.getId());
    return applicationId;
  }

看看应用计数器applicationCounter的定义,一个原子型的int值:

final private AtomicInteger applicationCounter = new AtomicInteger(0);

6.4.2 BuilderUtils

这个类的注释很简单,用来辅助构建一些不同的对象。我们看看前面用到的newApplicationId方法:

public static ApplicationId newApplicationId(RecordFactory recordFactory,
      long clustertimestamp, CharSequence id) {
    return ApplicationId.newInstance(clustertimestamp,
        Integer.parseInt(id.toString()));
  }

是不是感觉很坑,recordFactory参数根本没用,传进来干啥呢?

6.4.3 ApplicationId

这个类注释:ApplicationId代表应用的全局唯一表示,他的唯一性是用集群时间戳(如RM的启动时间)和一个单调自增的application计数器保证的。下面我们看前面使用的newInstance方法:

public static ApplicationId newInstance(long clusterTimestamp, int id) {
    ApplicationId appId = Records.newRecord(ApplicationId.class);
    appId.setClusterTimestamp(clusterTimestamp);
    appId.setId(id);
    appId.build();
    return appId;
  }

再往下用了google.protobuf来生成appId,因为作者目前还没有深入学习过protobuf,所以无法再往下深入,待有空研究后补上。

到这里我们关于ApplicationId的生成流程就讲完了,可见生成ApplicationId其实还没有真正提交应用到集群执行,下面我们开始讲Job提交流程。

6.2 Job提交

6.2.1 ClientRMService

@Override
  public SubmitApplicationResponse submitApplication(
      SubmitApplicationRequest request) throws YarnException {
    ApplicationSubmissionContext submissionContext = request
        .getApplicationSubmissionContext();
    ApplicationId applicationId = submissionContext.getApplicationId();

    // ApplicationSubmissionContext needs to be validated for safety - only
    // those fields that are independent of the RM's configuration will be
    // checked here, those that are dependent on RM configuration are validated
    // in RMAppManager.

    String user = null;
    try {
      // Safety
      user = UserGroupInformation.getCurrentUser().getShortUserName();
    } catch (IOException ie) {
      LOG.warn("Unable to get the current user.", ie);
      RMAuditLogger.logFailure(user, AuditConstants.SUBMIT_APP_REQUEST,
          ie.getMessage(), "ClientRMService",
          "Exception in submitting application", applicationId);
      throw RPCUtil.getRemoteException(ie);
    }

    // Check whether app has already been put into rmContext,
    // If it is, simply return the response
    if (rmContext.getRMApps().get(applicationId) != null) {
      LOG.info("This is an earlier submitted application: " + applicationId);
      return SubmitApplicationResponse.newInstance();
    }

    if (submissionContext.getQueue() == null) {
      submissionContext.setQueue(YarnConfiguration.DEFAULT_QUEUE_NAME);
    }
    if (submissionContext.getApplicationName() == null) {
      submissionContext.setApplicationName(
          YarnConfiguration.DEFAULT_APPLICATION_NAME);
    }
    if (submissionContext.getApplicationType() == null) {
      submissionContext
        .setApplicationType(YarnConfiguration.DEFAULT_APPLICATION_TYPE);
    } else {
      if (submissionContext.getApplicationType().length() > YarnConfiguration.APPLICATION_TYPE_LENGTH) {
        submissionContext.setApplicationType(submissionContext
          .getApplicationType().substring(0,
            YarnConfiguration.APPLICATION_TYPE_LENGTH));
      }
    }

    try {
      // 调用rmAppManager来启动Application。此君在Yarn启动章节提到过。
      rmAppManager.submitApplication(submissionContext,
          System.currentTimeMillis(), user);

      LOG.info("Application with id " + applicationId.getId() + 
          " submitted by user " + user);
      RMAuditLogger.logSuccess(user, AuditConstants.SUBMIT_APP_REQUEST,
          "ClientRMService", applicationId);
    } catch (YarnException e) {
      LOG.info("Exception in submitting application with id " +
          applicationId.getId(), e);
      RMAuditLogger.logFailure(user, AuditConstants.SUBMIT_APP_REQUEST,
          e.getMessage(), "ClientRMService",
          "Exception in submitting application", applicationId);
      throw e;
    }

    SubmitApplicationResponse response = recordFactory
        .newRecordInstance(SubmitApplicationResponse.class);
    return response;
  }

6.2.2 RMAppManager

这就是我们提到过管理App提交、结束、恢复等操作的RMAppManager,下面我们看他的submitApplication方法:

protected void submitApplication(
      ApplicationSubmissionContext submissionContext, long submitTime,
      String user) throws YarnException {
    // 从提交的应用上下文获取appID
    ApplicationId applicationId = submissionContext.getApplicationId();
	 // 创建RMAppImpl 这个类代表RM中运行的Application
    RMAppImpl application =
        createAndPopulateNewRMApp(submissionContext, submitTime, user, false);
    // 获取最新的ApplicationId
    ApplicationId appId = submissionContext.getApplicationId();
    Credentials credentials = null;
    try {
      credentials = parseCredentials(submissionContext);
      if (UserGroupInformation.isSecurityEnabled()) {
        this.rmContext.getDelegationTokenRenewer().addApplicationAsync(appId,
            credentials, submissionContext.getCancelTokensWhenComplete(),
            application.getUser());
      } else {
        // 这里就是传RMAppEventType.START事件给AsyncDispatcher处理
        // Dispatcher此时还没有启动,所以这里触发并入队的START事件应该被保证在dispatcher启动后第一时间处理
        this.rmContext.getDispatcher().getEventHandler()
            .handle(new RMAppEvent(applicationId, RMAppEventType.START));
      }
    } catch (Exception e) {
      LOG.warn("Unable to parse credentials.", e);
      // Sending APP_REJECTED is fine, since we assume that the
      // RMApp is in NEW state and thus we haven't yet informed the
      // scheduler about the existence of the application
      assert application.getState() == RMAppState.NEW;
      this.rmContext.getDispatcher().getEventHandler()
          .handle(new RMAppRejectedEvent(applicationId, e.getMessage()));
      throw RPCUtil.getRemoteException(e);
    }
  }

看一下上面用到的createAndPopulateNewRMApp方法:

private RMAppImpl createAndPopulateNewRMApp(
      ApplicationSubmissionContext submissionContext, long submitTime,
      String user, boolean isRecovery) throws YarnException {
    ApplicationId applicationId = submissionContext.getApplicationId();
    // 验证submissionContext并创建资源请求
    // ResourceRequest 代表一个由app发给RM的申请多个不同的contaner配额,
    // 包括了优先级、期望的机器或者机架名(*表示任意)、所需的资源、所需的container数、本地资源松弛(默认true)
    ResourceRequest amReq = 
        validateAndCreateResourceRequest(submissionContext, isRecovery);

    // 创建RMApp
    RMAppImpl application =
        new RMAppImpl(applicationId, rmContext, this.conf,
            submissionContext.getApplicationName(), user,
            submissionContext.getQueue(),
            submissionContext, this.scheduler, this.masterService,
            submitTime, submissionContext.getApplicationType(),
            submissionContext.getApplicationTags(), amReq);

    // 注意这里就将aplication放到了romContext中activeServiceContext内的容器,
    // 这个容器是一个以appId为key,RMApp为value的ConcurrentMap
    // 如果app并行提交时传入了相同applicationId,会失败并抛异常
    if (rmContext.getRMApps().putIfAbsent(applicationId, application) !=
        null) {
      String message = "Application with id " + applicationId
          + " is already present! Cannot add a duplicate!";
      LOG.warn(message);
      throw new YarnException(message);
    }
    // Inform the ACLs Manager
    this.applicationACLsManager.addApplication(applicationId,
        submissionContext.getAMContainerSpec().getApplicationACLs());
    String appViewACLs = submissionContext.getAMContainerSpec()
        .getApplicationACLs().get(ApplicationAccessType.VIEW_APP);
    rmContext.getSystemMetricsPublisher().appACLsUpdated(
        application, appViewACLs, System.currentTimeMillis());
    return application;
  }

6.2.3 AsyncDispatcher-RMAppEventType.START

前面提到过,这是一个异步的事件处理器。这里会经历GenericEventHandler.handle->createThread->dispatch,最后找到跟事件类型对应的handler调用handle方法进行处理。
在提交application时我们传入的事件class是RMAppEvent,他的type是org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppEventType,那我们看看该类型对应的hanler :
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher是怎么处理该事件的:

6.2.4 ApplicationEventDispatcher

他是ResourceManager的内部类。

@Override
    public void handle(RMAppEvent event) {
      ApplicationId appID = event.getApplicationId();
      // 从RM上下文中获取该app
      RMApp rmApp = this.rmContext.getRMApps().get(appID);
      if (rmApp != null) {
        try {
          rmApp.handle(event);
        } catch (Throwable t) {
          LOG.error("Error in handling event type " + event.getType()
              + " for application " + appID, t);
        }
      }
    }

6.2.5 RMAppImpl

@Override
  public void handle(RMAppEvent event) {
  	 // 写锁锁定
    this.writeLock.lock();
    try {
      ApplicationId appID = event.getApplicationId();
      LOG.debug("Processing event for " + appID + " of type "
          + event.getType());
      final RMAppState oldState = getState();
      try {
        // 让主服务器与状态机保持同步 这里传入的状态是RMAppEventType.START,RMAppEvent
        this.stateMachine.doTransition(event.getType(), event);
      } catch (InvalidStateTransitonException e) {
        LOG.error("Can't handle this event at current state", e);
        /* TODO fail the application on the failed transition */
      }

	   // 如果状态变更,就记录日志
      if (oldState != getState()) {
        LOG.info(appID + " State change from " + oldState + " to "
            + getState());
      }
    } finally {
      //最后解除写锁
      this.writeLock.unlock();
    }
  }

6.2.6 StateMachineFactory

这个名字很好认啊,状态机工厂,我们在这里输入事件是Start,状态是RMAppState.NEW:

@Override
    public synchronized STATE doTransition(EVENTTYPE eventType, EVENT event)
         throws InvalidStateTransitonException  {
      // 我们传进来的是EventType: START,
      // operand是RMAppImpl,currentState是RMAppState.NEW,eventType是START,event是RMAppEvent
      currentState = StateMachineFactory.this.doTransition
          (operand, currentState, eventType, event);
      // 经过StateMachineFactory.this.doTransition 后 currentState是RMAppState.NEW_SAVIN
      return currentState;
    }

这里提一下RMAppState

public enum RMAppState {
  NEW,
  NEW_SAVING,
  SUBMITTED,
  ACCEPTED,
  RUNNING,
  FINAL_SAVING,
  FINISHING,
  FINISHED,
  FAILED,
  KILLING,
  KILLED
}

接着看StateMachineFactory.this.doTransition(operand, currentState, eventType, event)

private STATE doTransition
           (OPERAND operand, STATE oldState, EVENTTYPE eventType, EVENT event)
      throws InvalidStateTransitonException {
    // 理解这一步很重要,他是一个存有当前状态为key,能转移的目标状态map为value的映射表
    Map<EVENTTYPE, Transition<OPERAND, STATE, EVENTTYPE, EVENT>> transitionMap
      = stateMachineTable.get(oldState);
    if (transitionMap != null) {
      Transition<OPERAND, STATE, EVENTTYPE, EVENT> transition
          = transitionMap.get(eventType);
      if (transition != null) {
        // 走到这里说明可以转换状态
        return transition.doTransition(operand, oldState, event, eventType);
      }
    }
    // 走到这里说明从当前状态不能转移到目标状态,抛出异常
    throw new InvalidStateTransitonException(oldState, eventType);
  }

紧接着看 transition.doTransition,这里是调用的内部类SingleInternalArc

@Override
    public STATE doTransition(OPERAND operand, STATE oldState,
                              EVENT event, EVENTTYPE eventType) {
      if (hook != null) {
        hook.transition(operand, event);
      }
      return postState;
    }

还得往下看hook.transition

6.2.7 RMAppImpl$RMAppNewlySavingTransition

这里又回到了RMAppImpl中,调用的是他的静态内部类RMAppNewlySavingTransition

private static final class RMAppNewlySavingTransition extends RMAppTransition {
    @Override
    public void transition(RMAppImpl app, RMAppEvent event) {
      // 如果启用了recovery,则以非阻塞调用来存储app信息,就可以确保RM已经存储了重新启动AM所需的信息,
      // 这样在RM重新启动后就无需进一步的客户端通信即可重启AM
      LOG.info("Storing application with id " + app.applicationId);
      app.rmContext.getStateStore().storeNewApplication(app);
    }
  }

6.2.8 RMStateStore

这个类我们之前提过,他管理者RM中的资源状态信息,下面看看他的方法:

/**
   * 非阻塞的API
   * RM服务使用这个方法来存储app状态信息
   * 他不会导致调用的线程(dispatcher)阻塞
   * RMAppStoredEvent将在完成时发送以通知RMApp
   */
  @SuppressWarnings("unchecked")
  public void storeNewApplication(RMApp app) {
    ApplicationSubmissionContext context = app.getApplicationSubmissionContext();
    assert context instanceof ApplicationSubmissionContextPBImpl;
    // 构建app状态数据示例,他包含了一个app所有需要被持久化的状态数据
    ApplicationStateData appState =
        ApplicationStateData.newInstance(
            app.getSubmitTime(), app.getStartTime(), context, app.getUser());
    // 这里的eventHandler是AsyncDispatcher.GenericEventHandler,把STORE_APP事件放入eventQueue
    dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState));
  }

我们这里可以简单看下RMStateStoreAppEvent

public class RMStateStoreAppEvent extends RMStateStoreEvent {

  private final ApplicationStateData appState;

  public RMStateStoreAppEvent(ApplicationStateData appState) {
    // 初始时RMStateStoreEventType为STORE_APP
    super(RMStateStoreEventType.STORE_APP);
    this.appState = appState;
  }

  public ApplicationStateData getAppState() {
    return appState;
  }
}

可以看到,初始时状态为STORE_APP

6.2.9 AsyncDispatcher-RMStateStoreEventType.STORE_APP

前面说到把STORE_APP事件放入了eventQueue,那就是线程createThread消费并处理事件。
这里是由RMStateStore.ForwardingEventHandler内部类的handle方法进行处理:

// 这个类作用是将store事件的处理从接口公共方法中隐蔽
 private final class ForwardingEventHandler implements EventHandler<RMStateStoreEvent> {
    @Override
    public void handle(RMStateStoreEvent event) {
      handleStoreEvent(event);
    }
  }

下面是handleStoreEvent

protected void handleStoreEvent(RMStateStoreEvent event) {
    try {
      this.stateMachine.doTransition(event.getType(), event);
    } catch (InvalidStateTransitonException e) {
      LOG.error("Can't handle this event at current state", e);
    }
  }

6.2.10 RMStateStore.StoreAppTransition

经过几次调用,到了RMStateStore中的内部类StoreAppTransition,我们来看看transition方法:

private static class StoreAppTransition
      implements SingleArcTransition<RMStateStore, RMStateStoreEvent> {
    @Override
    public void transition(RMStateStore store, RMStateStoreEvent event) {
      if (!(event instanceof RMStateStoreAppEvent)) {
        // should never happen
        LOG.error("Illegal event type: " + event.getClass());
        return;
      }
      ApplicationStateData appState =
          ((RMStateStoreAppEvent) event).getAppState();
      ApplicationId appId =
          appState.getApplicationSubmissionContext().getApplicationId();
      LOG.info("Storing info for app: " + appId);
      try {
        // 这一步就是实际持久化appState信息了
        store.storeApplicationStateInternal(appId, appState);
        // 这里组装了一个RMAppEventType.APP_NEW_SAVED
        // 这个方法的作用是通知应用程序,新的app已经持久化(存储或更新)了
        // 会发送一个RMAppEvent.APP_NEW_SAVED事件给AsyncDispatcher.GenericEventHandler的handle方法
        store.notifyApplication(new RMAppEvent(appId,
               RMAppEventType.APP_NEW_SAVED));
      } catch (Exception e) {
        LOG.error("Error storing app: " + appId, e);
        store.notifyStoreOperationFailed(e);
      }
    };
  }

6.2.11 ZKRMStateStore

在我们生产环境中,配置的是ZKRMStateStore,所以我们看看他的storeApplicationStateInternal方法:

@Override
  public synchronized void storeApplicationStateInternal(ApplicationId appId,
      ApplicationStateData appStateDataPB) throws Exception {
    // 构建目标zk路径  
    String nodeCreatePath = getNodePath(rmAppRoot, appId.toString());

    if (LOG.isDebugEnabled()) {
      LOG.debug("Storing info for app: " + appId + " at: " + nodeCreatePath);
    }
    // 带重试的把数据存到zk上
    byte[] appStateData = appStateDataPB.getProto().toByteArray();
    createWithRetries(nodeCreatePath, appStateData, zkAcl,
      CreateMode.PERSISTENT);
  }

我们就不继续往下深入了,回到任务提交上来。

6.2.12 RMAppImpl.AddApplicationToSchedulerTransition

6.2.10 中提到的store.notifyApplication(new RMAppEvent(appId,RMAppEventType.APP_NEW_SAVED))方法,经过重重调用会到达RMAppImpl.AddApplicationToSchedulerTransition内部类的transition方法:

private static final class AddApplicationToSchedulerTransition extends
      RMAppTransition {
    @Override
    public void transition(RMAppImpl app, RMAppEvent event) {
        // 这里向GenericEventHandler提交一个封装了App信息的AppAddedSchedulerEvent事件
        // 事件类型是SchedulerEventType.APP_ADDED
        app.handler.handle(new AppAddedSchedulerEvent(app.applicationId,
        app.submissionContext.getQueue(), app.user,
        app.submissionContext.getReservationID()));
    }
  }

6.2.13 ResourceManager.SchedulerEventDispatcher

经过GenericEventHandler处理,最终会把这个APP_ADDED事件交给ResourceManager的内部类SchedulerEventDispatcher,这里简单分析下吧:

// 他继承了AbstractService,这个我们已经很熟悉了
 // 实现了EventHandler,意味着他也是一个事件处理类,有实现handle方法
 public static class SchedulerEventDispatcher extends AbstractService
      implements EventHandler<SchedulerEvent> {
	 // 调度器对象
    private final ResourceScheduler scheduler;
    // 他也有一个事件阻塞队列
    private final BlockingQueue<SchedulerEvent> eventQueue =
      new LinkedBlockingQueue<SchedulerEvent>();
    private volatile int lastEventQueueSizeLogged = 0;
    // 处理事件的队列
    private final Thread eventProcessor;
    // 线程应该停止与否的标志
    private volatile boolean stopped = false;
    // 在执行事件过程中如果遇到异常是否应该导致程序退出
    private boolean shouldExitOnError = false;
	 // 构造方法,在前面介绍过,是RM在serviceInit方法中调用
    public SchedulerEventDispatcher(ResourceScheduler scheduler) {
      super(SchedulerEventDispatcher.class.getName());
      this.scheduler = scheduler;
      this.eventProcessor = new Thread(new EventProcessor());
      this.eventProcessor.setName("ResourceManager Event Processor");
    }

    @Override
    protected void serviceInit(Configuration conf) throws Exception {
      this.shouldExitOnError =
          conf.getBoolean(Dispatcher.DISPATCHER_EXIT_ON_ERROR_KEY,
            Dispatcher.DEFAULT_DISPATCHER_EXIT_ON_ERROR);
      super.serviceInit(conf);
    }

    @Override
    protected void serviceStart() throws Exception {
      this.eventProcessor.start();
      super.serviceStart();
    }

	 // 处理事件的线程,跟前面的 createThread 线程类似
    private final class EventProcessor implements Runnable {
      @Override
      public void run() {

        SchedulerEvent event;

        while (!stopped && !Thread.currentThread().isInterrupted()) {
          try {
            event = eventQueue.take();
          } catch (InterruptedException e) {
            LOG.error("Returning, interrupted : " + e);
            return; // TODO: Kill RM.
          }

          try {
            // 注意这里,把事件直接交给了我们的主角-调度器!!
            scheduler.handle(event);
          } catch (Throwable t) {
            // An error occurred, but we are shutting down anyway.
            // If it was an InterruptedException, the very act of 
            // shutdown could have caused it and is probably harmless.
            if (stopped) {
              LOG.warn("Exception during shutdown: ", t);
              break;
            }
            LOG.fatal("Error in handling event type " + event.getType()
                + " to the scheduler", t);
            if (shouldExitOnError
                && !ShutdownHookManager.get().isShutdownInProgress()) {
              LOG.info("Exiting, bbye..");
              System.exit(-1);
            }
          }
        }
      }
    }

    @Override
    protected void serviceStop() throws Exception {
      this.stopped = true;
      this.eventProcessor.interrupt();
      try {
        this.eventProcessor.join();
      } catch (InterruptedException e) {
        throw new YarnRuntimeException(e);
      }
      super.serviceStop();
    }

    @Override
    public void handle(SchedulerEvent event) {
      try {
        int qSize = eventQueue.size();
        if (qSize != 0 && qSize % 1000 == 0
            && lastEventQueueSizeLogged != qSize) {
          lastEventQueueSizeLogged = qSize;
          LOG.info("Size of scheduler event-queue is " + qSize);
        }
        int remCapacity = eventQueue.remainingCapacity();
        if (remCapacity < 1000) {
          LOG.info("Very low remaining capacity on scheduler event queue: "
              + remCapacity);
        }
        // 处理事件就是放入自己的阻塞队列,让处理线程去处理
        this.eventQueue.put(event);
      } catch (InterruptedException e) {
        LOG.info("Interrupted. Trying to exit gracefully.");
      }
    }
  }

6.2.14 FairScheduler

兜兜转转了好久,终于轮到我们的主角FairScheduler出场。
我们这里先看看前面代码触发的handle方法,因为传入事件的是APP_ADDED,所以会走以下分支:

case APP_ADDED:
      if (!(event instanceof AppAddedSchedulerEvent)) {
        throw new RuntimeException("Unexpected event type: " + event);
      }
      AppAddedSchedulerEvent appAddedEvent = (AppAddedSchedulerEvent) event;
      addApplication(appAddedEvent.getApplicationId(),
        appAddedEvent.getQueue(), appAddedEvent.getUser(),
        appAddedEvent.getIsAppRecovering());
      break;

以上代码在检验事件类型后就调用了addApplication方法:

// 将一个带appID 队列名 用户名的应用提交给调度器
// 就算是提交的用户或者队列已经超出配额限制,依然会接受提交,只是该app不会被标记为runnable
protected synchronized void addApplication(ApplicationId applicationId,
      String queueName, String user, boolean isAppRecovering) {
    if (queueName == null || queueName.isEmpty()) {
      String message = "Reject application " + applicationId +
              " submitted by user " + user + " with an empty queue name.";
      LOG.info(message);
      rmContext.getDispatcher().getEventHandler()
          .handle(new RMAppRejectedEvent(applicationId, message));
      return;
    }
    // 队列名称不能以 . 开头
    if (queueName.startsWith(".") || queueName.endsWith(".")) {
      String message = "Reject application " + applicationId
          + " submitted by user " + user + " with an illegal queue name "
          + queueName + ". "
          + "The queue name cannot start/end with period.";
      LOG.info(message);
      rmContext.getDispatcher().getEventHandler()
          .handle(new RMAppRejectedEvent(applicationId, message));
      return;
    }

    RMApp rmApp = rmContext.getRMApps().get(applicationId);
    // 尝试去分配app到指定队列,成功后会放入QueueManager管理的队列容器
    // 该方法在app被拒绝后悔调用适合的event-handler
    // 因为我用的是测试用例进行调试,传入的queue为default user为chengc,这里得到的是root.chengc
    FSLeafQueue queue = assignToQueue(rmApp, queueName, user);
    if (queue == null) {
      return;
    }

    // Enforce ACLs
    UserGroupInformation userUgi = UserGroupInformation.createRemoteUser(user);

    if (!queue.hasAccess(QueueACL.SUBMIT_APPLICATIONS, userUgi)
        && !queue.hasAccess(QueueACL.ADMINISTER_QUEUE, userUgi)) {
      String msg = "User " + userUgi.getUserName() +
              " cannot submit applications to queue " + queue.getName();
      LOG.info(msg);
      rmContext.getDispatcher().getEventHandler()
          .handle(new RMAppRejectedEvent(applicationId, msg));
      return;
    }
  
    SchedulerApplication<FSAppAttempt> application =
        new SchedulerApplication<FSAppAttempt>(queue, user);
    // 放入FairScheduler的父类AbstractYarnScheduler
    // 拥有的ConcurrentMap<ApplicationId, SchedulerApplication<T>> applications
    applications.put(applicationId, application);
    
    // 增加本队列和父队列的提交任务后的资源指标
    queue.getMetrics().submitApp(user);

    LOG.info("Accepted application " + applicationId + " from user: " + user
        + ", in queue: " + queueName + ", currently num of applications: "
        + applications.size());
    if (isAppRecovering) {
      if (LOG.isDebugEnabled()) {
        LOG.debug(applicationId + " is recovering. Skip notifying APP_ACCEPTED");
      }
    } else {
      // 完成后 向AsyncDispatcher提交一个RMAppEventType.APP_ACCEPTED事件
      rmContext.getDispatcher().getEventHandler()
        .handle(new RMAppEvent(applicationId, RMAppEventType.APP_ACCEPTED));
    }
  }

6.2.15 RMAppImpl-createAndStartNewAttempt

经过AsyncDispatcherResourceMange.ApplicationEventDispatcher处理,到达了RMAppImpl.StartAppAttemptTransition内部类中:

// 这个类看名字就知道是专门负责尝试启动App的
private static final class StartAppAttemptTransition extends RMAppTransition {
    @Override
    public void transition(RMAppImpl app, RMAppEvent event) {
      app.createAndStartNewAttempt(false);
    };
  }

看看RMAppImpl.createAndStartNewAttempt方法

private void createAndStartNewAttempt(boolean transferStateFromPreviousAttempt) {
	 // 创建一个新的App尝试
    createNewAttempt();
    // 向AsyncDispatcher提交type为RMAppAttemptEventType.START类型的RMAppStartAttemptEvent事件
    handler.handle(new RMAppStartAttemptEvent(currentAttempt.getAppAttemptId(),
      transferStateFromPreviousAttempt));
  }

再看看createNewAttempt方法:

private void createNewAttempt() {
    // 根据appId生成一个attemptId
    ApplicationAttemptId appAttemptId =
        ApplicationAttemptId.newInstance(applicationId, attempts.size() + 1);

    BlacklistManager currentAMBlacklist;
    // AM的Container黑名单(节点级别)
    if (currentAttempt != null) {
      currentAMBlacklist = currentAttempt.getAMBlacklist();
    } else {
      if (amBlacklistingEnabled) {
        currentAMBlacklist = new SimpleBlacklistManager(
            scheduler.getNumClusterNodes(), blacklistDisableThreshold);
      } else {
        currentAMBlacklist = new DisabledBlacklistManager();
      }
    }
    
    // 如果(之前失败的尝试次数(不包括抢占,硬件错误和NM重新同步)+ 1)等于最大尝试限制,
    // 则新创建的尝试可能是最后一次尝试
    RMAppAttempt attempt =
        new RMAppAttemptImpl(appAttemptId, rmContext, scheduler, masterService,
          submissionContext, conf,
          maxAppAttempts == (getNumFailedAppAttempts() + 1), amReq,
          currentAMBlacklist);
    attempts.put(appAttemptId, attempt);
    currentAttempt = attempt;
  }

经过一系列处理,会触发FairSchedulerhandle方法,传递的是APP_ATTEMPT_ADDED事件,然后调用addApplicationAttempt方法:

// 向调度器(FairScheduler)提交一个app尝试
protected synchronized void addApplicationAttempt(
      ApplicationAttemptId applicationAttemptId,
      boolean transferStateFromPreviousAttempt,
      boolean isAttemptRecovering) {
    // 注意,我们在前面FairScheduler接收到APP_ADDED事件的时候已经放入了该app
    SchedulerApplication<FSAppAttempt> application =
        applications.get(applicationAttemptId.getApplicationId());
    String user = application.getUser();
    FSLeafQueue queue = (FSLeafQueue) application.getQueue();
	 //FSAppAttemp代表的是从FairScheduler的角度来表示app尝试
    FSAppAttempt attempt =
        new FSAppAttempt(this, applicationAttemptId, user,
            queue, new ActiveUsersManager(getRootQueueMetrics()),
            rmContext);
    if (transferStateFromPreviousAttempt) {
      attempt.transferStateFromPreviousAttempt(application
          .getCurrentAppAttempt());
    }
    application.setCurrentAppAttempt(attempt);
	 // 检查该app是否超出资源配额
    boolean runnable = maxRunningEnforcer.canAppBeRunnable(queue, user);
    // 根据runnable情况放入FSLeafQueue的runnableApps或者nonRunnableApps
    queue.addApp(attempt, runnable);
    if (runnable) {
      // 将该任务所属父队列runnableApps数量增加1;该应用提交用户对应的应用数加1
      // 这样做的目的是维护最大运行应用数的限制
      maxRunningEnforcer.trackRunnableApp(attempt);
    } else {
      // 不可运行的应用也要登记,这样的话当该应用不超过应用最大可运行数时就能变为runnable
      maxRunningEnforcer.trackNonRunnableApp(attempt);
    }
    // 记录队列、用户指标
    queue.getMetrics().submitAppAttempt(user);

    LOG.info("Added Application Attempt " + applicationAttemptId
        + " to scheduler from user: " + user);

    if (isAttemptRecovering) {
      if (LOG.isDebugEnabled()) {
        LOG.debug(applicationAttemptId
            + " is recovering. Skipping notifying ATTEMPT_ADDED");
      }
    } else {
      // 熟悉的一句话,向AsyncDispatcher的GenericEventHandler发送RMAppAttemptEventType.ATTEMPT_ADDED事件
      // 注意和前文的来自RMAppAttempt的SchedulerEventType.APP_ATTEMPT_ADDED区分
      rmContext.getDispatcher().getEventHandler().handle(
              new RMAppAttemptEvent(applicationAttemptId,
                      RMAppAttemptEventType.ATTEMPT_ADDED));
    }
  }

6.2.16 ResourceManager.ApplicationAttemptEventDispatcher

经过处理,会达到内部类ApplicationAttemptEventDispatcher.handle

public static final class ApplicationAttemptEventDispatcher implements
      EventHandler<RMAppAttemptEvent> {
    private final RMContext rmContext;
    public ApplicationAttemptEventDispatcher(RMContext rmContext) {
      this.rmContext = rmContext;
    }
    @Override
    public void handle(RMAppAttemptEvent event) {
      ApplicationAttemptId appAttemptID = event.getApplicationAttemptId();
      ApplicationId appAttemptId = appAttemptID.getApplicationId();
      RMApp rmApp = this.rmContext.getRMApps().get(appAttemptId);
      if (rmApp != null) {
        RMAppAttempt rmAppAttempt = rmApp.getRMAppAttempt(appAttemptID);
        if (rmAppAttempt != null) {
          try {
            // 交给RMAppAttemptImpl处理该ATTEMPT_ADDED事件
            rmAppAttempt.handle(event);
          } catch (Throwable t) {
            LOG.error("Error in handling event type " + event.getType()
                + " for applicationAttempt " + appAttemptId, t);
          }
        }
      }
    }
  }

6.2.17 RMAppAttemptImpl

RMAppAttemptImpl收到该事件后,会调用stateMachine.doTransition方法,此时事件类型是RMAppAttemptEventType.ATTEMPT_ADDED,状态为RMAppAttemptState.SUBMITTED经过流转后执行以下代码:

public static final class ScheduleTransition implements
      MultipleArcTransition<RMAppAttemptImpl, RMAppAttemptEvent, RMAppAttemptState> {
    @Override
    public RMAppAttemptState transition(RMAppAttemptImpl appAttempt,
        RMAppAttemptEvent event) {
      ApplicationSubmissionContext subCtx = appAttempt.submissionContext;
      // 该提交必须属于RM管理的才会正常分配资源和启动
      if (!subCtx.getUnmanagedAM()) {
        // Need reset #containers before create new attempt, because this request
        // will be passed to scheduler, and scheduler will deduct the number after
        // AM container allocated
        // 在创建新的尝试前需要重置 containers,
        //因为这个请求会被传递给调度器而且调度器在为AM分配container后扣除该数字?
        
        // 注意,当前版本代码中下面这些值域都是硬编码,后序的版本会支持修改
        // 设定所需container数
        appAttempt.amReq.setNumContainers(1);
        // 设定优先级
        appAttempt.amReq.setPriority(AM_CONTAINER_PRIORITY);
        // 设定资源名
        appAttempt.amReq.setResourceName(ResourceRequest.ANY);
        // 关于RelaxLocality会在附录1.1中讲解
        appAttempt.amReq.setRelaxLocality(true);

        appAttempt.getAMBlacklist().refreshNodeHostCount(
            appAttempt.scheduler.getNumClusterNodes());
		 // App持有的节点黑名单
        BlacklistUpdates amBlacklist = appAttempt.getAMBlacklist()
            .getBlacklistUpdates();
        if (LOG.isDebugEnabled()) {
          LOG.debug("Using blacklist for AM: additions(" +
              amBlacklist.getAdditions() + ") and removals(" +
              amBlacklist.getRemovals() + ")");
        }
        // AM 资源已经检查过了,所以我们可以直接提交请求
        // 这一步是十分关键点代码,让调度器开始分配资源。
        // AM会更新他的资源需求,而且可能会释放他不需要的container
        Allocation amContainerAllocation =
            appAttempt.scheduler.allocate(
                appAttempt.applicationAttemptId,
                Collections.singletonList(appAttempt.amReq),
                EMPTY_CONTAINER_RELEASE_LIST,
                amBlacklist.getAdditions(),
                amBlacklist.getRemovals());
        if (amContainerAllocation != null
            && amContainerAllocation.getContainers() != null) {
          assert (amContainerAllocation.getContainers().size() == 0);
        }
        // 分配资源的登记完成,返回 SCHEDULED 状态
        return RMAppAttemptState.SCHEDULED;
      } else {
        // save state and then go to LAUNCHED state
        appAttempt.storeAttempt();
        return RMAppAttemptState.LAUNCHED_UNMANAGED_SAVING;
      }
    }
  }

执行完成后状态为RMAppAttemptState.SCHEDULED 以上代码中提到的RelaxLocality,更多信息请点击这里

6.2.18 FairScheduler-allocate

@Override
  public Allocation allocate(ApplicationAttemptId appAttemptId,
      List<ResourceRequest> ask, List<ContainerId> release,
      List<String> blacklistAdditions, List<String> blacklistRemovals) {

    // 确保app存在
    FSAppAttempt application = getSchedulerApp(appAttemptId);
    if (application == null) {
      LOG.info("Calling allocate on removed " +
          "or non existant application " + appAttemptId);
      return EMPTY_ALLOCATION;
    }

    // 对资源申请的请求进行合理性检验
    SchedulerUtils.normalizeRequests(ask, DOMINANT_RESOURCE_CALCULATOR,
        clusterResource, minimumAllocation, getMaximumResourceCapability(),
        incrAllocation);

    // Record container allocation start time
    application.recordContainerRequestTime(getClock().getTime());

    // 释放 containers
    releaseContainers(release, application);

    synchronized (application) {
      //ask为申请的资源,判断是否为空
      if (!ask.isEmpty()) {
        if (LOG.isDebugEnabled()) {
          LOG.debug("allocate: pre-update" +
              " applicationAttemptId=" + appAttemptId +
              " application=" + application.getApplicationId());
        }
        // debug时打印申请资源详情
        application.showRequests();

        // 在AppSchedulingInfo中更新应用的container资源消耗情况
        application.updateResourceRequests(ask);

        application.showRequests();
      }

      if (LOG.isDebugEnabled()) {
        LOG.debug("allocate: post-update" +
            " applicationAttemptId=" + appAttemptId +
            " #ask=" + ask.size() +
            " reservation= " + application.getCurrentReservation());

        LOG.debug("Preempting " + application.getPreemptionContainers().size()
            + " container(s)");
      }

      Set<ContainerId> preemptionContainerIds = new HashSet<ContainerId>();
      for (RMContainer container : application.getPreemptionContainers()) {
        preemptionContainerIds.add(container.getContainerId());
      }
	   // 判断app是不是还在傻傻等待AM的Container
      if (application.isWaitingForAMContainer(application.getApplicationId())) {
        // 进入了这里就说明是在为AM分配Contaienr,需要更新用于AM的containers黑名单
        application.updateAMBlacklist(
            blacklistAdditions, blacklistRemovals);
      } else {
        // 更新用于非am的containers黑名单
        application.updateBlacklist(blacklistAdditions, blacklistRemovals);
      }

	   // 生成app新分配的container的token和所在NM的Token
	   // 其中RM分配的containerToken作用是用作NM在启动container时进行验证,这个token对app透明,由整个框架管理
	   // 而NMToken是和NM通信时进行身份验证。
	   // 当AM向RM申请资源时由RM生成NMToken,而验证是在NM侧进行
      ContainersAndNMTokensAllocation allocation =
          application.pullNewlyAllocatedContainersAndNMTokens();

      // 记录container分配的时间
      if (!(allocation.getContainerList().isEmpty())) {
        application.recordContainerAllocationTime(getClock().getTime());
      }

	   // 最终返回一个分配实体
      return new Allocation(allocation.getContainerList(),
        application.getHeadroom(), preemptionContainerIds, null, null,
        allocation.getNMTokenList());
    }
  }

6.3 小结

本章主要通过分析了在服务端侧处理Map任务的过程,其实还是挺简单的,就是那个固定套路。到这里我们的任务提交过程就分析完了。

下一章,我们会讲一讲多次用到的ShutdownHookManager