文章目录

  • 1.用法
  • 2.解析
  • 2.1.可设置资源
  • 2.2.ResourceProfile
  • 2.3.转换链
  • 2.3.1.StreamGraphGenerator
  • 2.3.2.StreamGraph
  • 2.3.3.JobVertex
  • 2.3.4.ExecutionJobVertex
  • 3.处理方式
  • 3.1.实现类
  • 3.2.processResourceRequirements
  • 3.3.checkResourceRequirementsWithDelay
  • 3.4.checkResourceRequirements
  • 3.4.1.获取缺失资源信息
  • 3.4.2.资源适配
  • 3.4.3.tryFulfilledRequirementWithResource
  • 3.4.4.tryFulfillRequirementsForJobWithPendingResources
  • 3.4.5.resultBuilder.build()
  • 3.5.allocateSlotsAccordingTo
  • 3.6.allocateTaskManagersAccordingTo
  • 3.7.失败处理
  • 4.触发流程


1.用法

类:flink-core\src\main\java\org\apache\flink\api\common\operators\SlotSharingGroup.java
业务用法如下:注意registerSlotSharingGroup
注意需要开启细粒度开关:cluster.fine-grained-resource-management.enabled

final SlotSharingGroup ssg =
SlotSharingGroup.newBuilder("ssg1").setCpuCores(1).setTaskHeapMemoryMB(100).build();

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.registerSlotSharingGroup(ssg);

final DataStream<Integer> source = env.fromElements(1).slotSharingGroup("ssg1");
...

2.解析

2.1.可设置资源

可以设置的参数主要是CPU和三个内存设置,另外还有扩展资源设置(如GPU)

this.name = checkNotNull(name);
this.cpuCores = checkNotNull(cpuCores);
this.taskHeapMemory = checkNotNull(taskHeapMemory);
this.taskOffHeapMemory = checkNotNull(taskOffHeapMemory);
this.managedMemory = checkNotNull(managedMemory);
this.externalResources.putAll(checkNotNull(extendedResources));

2.2.ResourceProfile

客户端使用SlotSharingGroup,服务端识别的应该是ResourceProfile。在上文的示例中使用了registerSlotSharingGroup,其内容如下,最终转换成了ResourceProfile
MemorySize.ZERO对应的是networkMemory,这是不设置的意思?因为网络资源应该是TaskManager内共享的

@PublicEvolving
public StreamExecutionEnvironment registerSlotSharingGroup(SlotSharingGroup slotSharingGroup) {
    final ResourceSpec resourceSpec =
            SlotSharingGroupUtils.extractResourceSpec(slotSharingGroup);
    if (!resourceSpec.equals(ResourceSpec.UNKNOWN)) {
        this.slotSharingGroupResources.put(
                slotSharingGroup.getName(),
                ResourceProfile.fromResourceSpec(
SlotSharingGroupUtils.extractResourceSpec(slotSharingGroup),
                        MemorySize.ZERO));
    }
    return this;
}

2.3.转换链

registerSlotSharingGroup将资源注入了Environment当中记录,接下来会进行各种传递直到服务端

2.3.1.StreamGraphGenerator

业务代码首先会生成StreamGraph,会用到StreamGraphGenerator。在创建StreamGraphGenerator的时候,资源信息传入了进去
slotSharingGroupResources即registerSlotSharingGroup时的资源信息队列

private StreamGraphGenerator getStreamGraphGenerator(List<Transformation<?>> transformations) {
    if (transformations.size() <= 0) {
        throw new IllegalStateException(
                "No operators defined in streaming topology. Cannot execute.");
    }

    return new StreamGraphGenerator(transformations, config, checkpointCfg, configuration)
            .setStateBackend(defaultStateBackend)
            .setChangelogStateBackendEnabled(changelogStateBackendEnabled)
            .setSavepointDir(defaultSavepointDirectory)
            .setChaining(isChainingEnabled)
            .setUserArtifacts(cacheFile)
            .setTimeCharacteristic(timeCharacteristic)
            .setDefaultBufferTimeout(bufferTimeout)
            .setSlotSharingGroupResource(slotSharingGroupResources);
}

2.3.2.StreamGraph

StreamGraph是Flink三张图中的第一张,一般在客户端生成传给服务端,是服务端生成作业的基础。调用在StreamGraphGenerator.generate的时候,资源信息传入StreamGraph

streamGraph.setSlotSharingGroupResource(slotSharingGroupResources);

2.3.3.JobVertex

JobVertex是JobGraph当中的概念,代表多个StreamNode经过chain形成的聚合体,其输入是JobEdge,输出是IntermediateDataSet。就是JobGraph当中的一个节点,最终会按并行度生成Task,一个JobVertex基本就对应一组Task。
调用StreamingJobGraphGenerator.setSlotSharing的时候,将资源信息传入JobVertex(在createJobGraph当中调用)
DEFAULT_SLOT_SHARING_GROUP即默认设置,slot均分资源的方式
SlotSharingGroup控制slot共享,相同SlotSharingGroup的可以在同一个slot中运行

if (slotSharingGroupKey.equals(StreamGraphGenerator.DEFAULT_SLOT_SHARING_GROUP)) {
    // fallback to the region slot sharing group by default
    effectiveSlotSharingGroup =
            checkNotNull(vertexRegionSlotSharingGroups.get(vertex.getID()));
} else {
    effectiveSlotSharingGroup =
            specifiedSlotSharingGroups.computeIfAbsent(
                    slotSharingGroupKey,
                    k -> {
                        SlotSharingGroup ssg = new SlotSharingGroup();
                        streamGraph
                                .getSlotSharingGroupResource(k)
                                .ifPresent(ssg::setResourceProfile);
                        return ssg;
                    });
}

vertex.setSlotSharingGroup(effectiveSlotSharingGroup);

2.3.4.ExecutionJobVertex

ExecutionJobVertex是JobVertex对应的ExecutionGraph中的概念,SlotSharingGroup也会对应的传到ExecutionJobVertex当中

// take the sharing group
this.slotSharingGroup = checkNotNull(jobVertex.getSlotSharingGroup());
this.coLocationGroup = jobVertex.getCoLocationGroup();

3.处理方式

有一个问题:资源分配看上去都是从Job侧触发的,这跟声明式有什么差异
按声明式资源分配的说法,旧的机制是基于task分配的,声明式做了优化,按作业整体分配,这个细粒度分配看上去也是完成了这个优化(似乎是按group分配的)

3.1.实现类

cluster.fine-grained-resource-management.enabled为true开启细粒度资源分配
细粒度资源分配使用的实现类是FineGrainedSlotManager
声明式管理和细粒度管理不共用?也就是和自定义调度器不兼容

if (configuration.isEnableFineGrainedResourceManagement()) {
    return new FineGrainedSlotManager(
            scheduledExecutor,
            slotManagerConfiguration,
            slotManagerMetricGroup,
            new DefaultResourceTracker(),
            new FineGrainedTaskManagerTracker(),
            new DefaultSlotStatusSyncer(
                    slotManagerConfiguration.getTaskManagerRequestTimeout()),
            new DefaultResourceAllocationStrategy(
                    SlotManagerUtils.generateTaskManagerTotalResourceProfile(
                            slotManagerConfiguration.getDefaultWorkerResourceSpec()),
                    slotManagerConfiguration.getNumSlotsPerWorker()),
            Time.milliseconds(REQUIREMENTS_CHECK_DELAY_MS));
} else {
    return new DeclarativeSlotManager(
            scheduledExecutor,
            slotManagerConfiguration,
            slotManagerMetricGroup,
            new DefaultResourceTracker(),
            new DefaultSlotTracker());
}

3.2.processResourceRequirements

处理接口应该是实现类的processResourceRequirements。主要为以下步骤
其中,resourceRequirements的targetAddress是指JobMaster的地址

if (resourceRequirements.getResourceRequirements().isEmpty()) {
    jobMasterTargetAddresses.remove(resourceRequirements.getJobId());
} else {
    jobMasterTargetAddresses.put(
            resourceRequirements.getJobId(), resourceRequirements.getTargetAddress());
}
resourceTracker.notifyResourceRequirements(
        resourceRequirements.getJobId(), resourceRequirements.getResourceRequirements());
checkResourceRequirementsWithDelay();

3.3.checkResourceRequirementsWithDelay

为了降低负载提升性能,对资源的需求检查采用延迟检查的方式,这样一次检查可以覆盖更多的变更。延迟时间目前是固定的50毫秒,不开放修改

private void checkResourceRequirementsWithDelay() {
    if (requirementsCheckFuture == null || requirementsCheckFuture.isDone()) {
        requirementsCheckFuture = new CompletableFuture<>();
        scheduledExecutor.schedule(
                () ->
                        mainThreadExecutor.execute(
                                () -> {
                                    checkResourceRequirements();
                          Preconditions.checkNotNull(requirementsCheckFuture)
                                            .complete(null);
                                }),
                requirementsCheckDelay.toMilliseconds(),
                TimeUnit.MILLISECONDS);
    }
}

3.4.checkResourceRequirements

这个是实际的资源检查接口,会进行实际的资源分配

3.4.1.获取缺失资源信息

在processResourceRequirements当中,会将资源申请入列;本接口的后续处理步骤中,会有将已分配信息入列。此步骤是根据这两个信息,获取作业还缺失的资源信息

public Collection<ResourceRequirement> getMissingResources() {
    final Collection<ResourceRequirement> missingResources = new ArrayList<>();
    for (Map.Entry<ResourceProfile, Integer> requirement :
            resourceRequirements.getResourcesWithCount()) {
        ResourceProfile requirementProfile = requirement.getKey();

        int numRequiredResources = requirement.getValue();
        int numAcquiredResources =
resourceToRequirementMapping.getNumFulfillingResources(requirementProfile);
        if (numAcquiredResources < numRequiredResources) {
            missingResources.add(
                    ResourceRequirement.create(
                            requirementProfile, numRequiredResources - numAcquiredResources));
        }
    }
    return missingResources;
}

3.4.2.资源适配

此步骤按资源信息进行资源适配,不进行实际的分配。在ResourceAllocationStrategy的tryFulfillRequirements中进行处理
这里定义了两个资源类型:registeredResources和pendingResources。registeredResources对应的应该是所有的注册的TaskManager。pendingResources的资源类初始创建是在后续流程的tryFulfillRequirementsForJobWithPendingResources里,如下

public ResourceAllocationResult tryFulfillRequirements(
        Map<JobID, Collection<ResourceRequirement>> missingResources,
        TaskManagerResourceInfoProvider taskManagerResourceInfoProvider) {
    final ResourceAllocationResult.Builder resultBuilder = ResourceAllocationResult.builder();

    final List<InternalResourceInfo> registeredResources =
            getRegisteredResources(taskManagerResourceInfoProvider, resultBuilder);
    final List<InternalResourceInfo> pendingResources =
            getPendingResources(taskManagerResourceInfoProvider, resultBuilder);

    for (Map.Entry<JobID, Collection<ResourceRequirement>> resourceRequirements :
            missingResources.entrySet()) {
        final JobID jobId = resourceRequirements.getKey();

        final Collection<ResourceRequirement> unfulfilledJobRequirements =
                tryFulfillRequirementsForJobWithResources(
                        jobId, resourceRequirements.getValue(), registeredResources);

        if (!unfulfilledJobRequirements.isEmpty()) {
            tryFulfillRequirementsForJobWithPendingResources(
                    jobId, unfulfilledJobRequirements, pendingResources, resultBuilder);
        }
    }
    return resultBuilder.build();
}

3.4.3.tryFulfilledRequirementWithResource

这一步是适配资源时tryFulfillRequirementsForJobWithResources最终的点,进行实际的分配。分配方式是按TaskManager,优先在一个节点上进行分配,不足换TaskManager。
最后返回的numUnfulfilled代表此次分配没有满足资源的数量
注意这个方法是每个requiredResource调用一次,轮询调用的

private static int tryFulfilledRequirementWithResource(
        List<InternalResourceInfo> internalResource,
        int numUnfulfilled,
        ResourceProfile requiredResource,
        JobID jobId) {
    final Iterator<InternalResourceInfo> internalResourceInfoItr = internalResource.iterator();
    while (numUnfulfilled > 0 && internalResourceInfoItr.hasNext()) {
        final InternalResourceInfo currentTaskManager = internalResourceInfoItr.next();
        while (numUnfulfilled > 0
                && currentTaskManager.tryAllocateSlotForJob(jobId, requiredResource)) {
            numUnfulfilled--;
        }
        if (currentTaskManager.availableProfile.equals(ResourceProfile.ZERO)) {
            internalResourceInfoItr.remove();
        }
    }
    return numUnfulfilled;
}

3.4.4.tryFulfillRequirementsForJobWithPendingResources

这是对上一节分配资源后还有资源没满足的再进行一次分配,分配方式还是一样的,只是分配的是PendingTaskManager的资源,并且多了后续处理步骤。后续处理步骤里会创建前文说的PendingTaskManager
PendingTaskManager的设计是为了支持对挂起任务执行器的动态分配(官方说法,带解读)
PendingTaskManager的创建依赖于totalResourceProfile,其设置来源于WorkerResourceSpecFactory的createDefaultWorkerResourceSpec,不同的资源管理器有不同的实现,SA模式的实现为0,也就是说不会用到PendingTaskManager
ArbitraryWorkerResourceSpecFactory的实现如下

public WorkerResourceSpec createDefaultWorkerResourceSpec(Configuration configuration) {
    return WorkerResourceSpec.ZERO;
}

YarnWorkerResourceSpecFactory的实现如下,根据内存CPU资源构建了一个资源说明文件。按照这个逻辑理解,PendingTaskManager的作用是给具有动态申请资源的资源管理器使用的一个虚构概念,方便实际分配时动态扩展可以快速进行

protected WorkerResourceSpec workerResourceSpecFromConfigAndCpu(
        Configuration configuration, CPUResource cpuResource) {
    final TaskExecutorProcessSpec taskExecutorProcessSpec =
            TaskExecutorProcessUtils.newProcessSpecBuilder(configuration)
                    .withCpuCores(cpuResource)
                    .build();
    return WorkerResourceSpec.fromTaskExecutorProcessSpec(taskExecutorProcessSpec);
}

这里注意的是,此处会按设置的TaskManager的资源进行一整个申请,然后按这个资源量去进行适配,适配完如果还剩,会放入列表返回,最终都是放入resultBuilder,resultBuilder会在上层使用。
注意添加分两种:addPendingTaskManagerAllocate、addAllocationOnPendingResource

while (numUnfulfilled > 0
        && canFulfillRequirement(effectiveProfile, remainResource)) {
    numUnfulfilled--;
    resultBuilder.addAllocationOnPendingResource(
            jobId,
            newPendingTaskManager.getPendingTaskManagerId(),
            effectiveProfile);
    remainResource = remainResource.subtract(effectiveProfile);
}
if (!remainResource.equals(ResourceProfile.ZERO)) {
    availableResources.add(
            new InternalResourceInfo(

3.4.5.resultBuilder.build()

这一步就是将上诉的分配结果构建成对象,给下一步使用

public ResourceAllocationResult build() {
    return new ResourceAllocationResult(
            unfulfillableJobs,
            allocationsOnRegisteredResources,
            pendingTaskManagersToAllocate,
            allocationsOnPendingResources);
}

3.5.allocateSlotsAccordingTo

这一步是安装上一步的分配结果进行实际分配,最终的分配在DefaultSlotStatusSyncer的allocateSlot。此步骤分配的资源范围仅包括已经存在的TaskManager的,也就是说不包括PendingTaskManager的内容
其过程就是根据TaskManager获取其gateway,然后调用gateway.requestSlot进行资源申请,之后就到TaskExecutor上进行资源分配及Job信息创建的过程

final Optional<TaskManagerInfo> taskManager =
        taskManagerTracker.getRegisteredTaskManager(instanceId);
final TaskExecutorGateway gateway =
        taskManager.get().getTaskExecutorConnection().getTaskExecutorGateway();
CompletableFuture<Acknowledge> requestFuture =
        gateway.requestSlot(

3.6.allocateTaskManagersAccordingTo

这一步是对PendingTaskManager相关进行分配
首先判断资源是否达到可用上限,上限配置为CPU和Mem,分别由以下两个配置进行控制:slotmanager.max-total-resource.cpu、slotmanager.max-total-resource.memory
如果不配置,则由slotmanager.number-of-slots.max控制。这个划分有疑问,配置的内存等资源是对单个进程的,基于slotmanager.number-of-slots.max做划分后的意义不一样了?

if (isMaxTotalResourceExceededAfterAdding(pendingTaskManager.getTotalResourceProfile())) {
    LOG.info(
            "Could not allocate {}. Max total resource limitation <{}, {}> is reached.",
            pendingTaskManager,
            maxTotalCpu,
            maxTotalMem.toHumanReadableString());
    return false;
}

申请新资源。只有Yarn、K8S这样的动态扩展TaskManager的资源管理器才支持。当检查通过后,就去申请新的TaskManager。在ActiveResourceManager的requestNewWorker当中进行申请
申请完成后,会将其放入taskManagerTracker列表当中,与前面对应获取

taskManagerTracker.addPendingTaskManager(pendingTaskManager);

3.7.失败处理

后续有一些失败的处理,主要是做一些列表的更新

4.触发流程

主要看资源分配是从哪个动作触发的。分配流程的触发是有JobMaster发起的,追踪processResourceRequirements的调用链,源头在JobMaster的establishResourceManagerConnection

slotPoolService.connectToResourceManager(resourceManagerGateway);

establishResourceManagerConnection是RegisteredRpcConnection启动时触发注册调用的,ResourceManagerConnection继承自RegisteredRpcConnection,也就是说实际就是创建ResourceManagerConnection然后启动时触发,ResourceManagerConnection创建是由reconnectToResourceManager触发的,而reconnectToResourceManager有三个触发点,其中一个是ResourceManagerLeaderListener,启动应该是由其触发的
JobMaster是在启动的时候会注册监听器,在startJobMasterServices当中

resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());