文章目录
- 1.用法
- 2.解析
- 2.1.可设置资源
- 2.2.ResourceProfile
- 2.3.转换链
- 2.3.1.StreamGraphGenerator
- 2.3.2.StreamGraph
- 2.3.3.JobVertex
- 2.3.4.ExecutionJobVertex
- 3.处理方式
- 3.1.实现类
- 3.2.processResourceRequirements
- 3.3.checkResourceRequirementsWithDelay
- 3.4.checkResourceRequirements
- 3.4.1.获取缺失资源信息
- 3.4.2.资源适配
- 3.4.3.tryFulfilledRequirementWithResource
- 3.4.4.tryFulfillRequirementsForJobWithPendingResources
- 3.4.5.resultBuilder.build()
- 3.5.allocateSlotsAccordingTo
- 3.6.allocateTaskManagersAccordingTo
- 3.7.失败处理
- 4.触发流程
1.用法
类:flink-core\src\main\java\org\apache\flink\api\common\operators\SlotSharingGroup.java
业务用法如下:注意registerSlotSharingGroup
注意需要开启细粒度开关:cluster.fine-grained-resource-management.enabled
final SlotSharingGroup ssg =
SlotSharingGroup.newBuilder("ssg1").setCpuCores(1).setTaskHeapMemoryMB(100).build();
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.registerSlotSharingGroup(ssg);
final DataStream<Integer> source = env.fromElements(1).slotSharingGroup("ssg1");
...
2.解析
2.1.可设置资源
可以设置的参数主要是CPU和三个内存设置,另外还有扩展资源设置(如GPU)
this.name = checkNotNull(name);
this.cpuCores = checkNotNull(cpuCores);
this.taskHeapMemory = checkNotNull(taskHeapMemory);
this.taskOffHeapMemory = checkNotNull(taskOffHeapMemory);
this.managedMemory = checkNotNull(managedMemory);
this.externalResources.putAll(checkNotNull(extendedResources));
2.2.ResourceProfile
客户端使用SlotSharingGroup,服务端识别的应该是ResourceProfile。在上文的示例中使用了registerSlotSharingGroup,其内容如下,最终转换成了ResourceProfile
MemorySize.ZERO对应的是networkMemory,这是不设置的意思?因为网络资源应该是TaskManager内共享的
@PublicEvolving
public StreamExecutionEnvironment registerSlotSharingGroup(SlotSharingGroup slotSharingGroup) {
final ResourceSpec resourceSpec =
SlotSharingGroupUtils.extractResourceSpec(slotSharingGroup);
if (!resourceSpec.equals(ResourceSpec.UNKNOWN)) {
this.slotSharingGroupResources.put(
slotSharingGroup.getName(),
ResourceProfile.fromResourceSpec(
SlotSharingGroupUtils.extractResourceSpec(slotSharingGroup),
MemorySize.ZERO));
}
return this;
}
2.3.转换链
registerSlotSharingGroup将资源注入了Environment当中记录,接下来会进行各种传递直到服务端
2.3.1.StreamGraphGenerator
业务代码首先会生成StreamGraph,会用到StreamGraphGenerator。在创建StreamGraphGenerator的时候,资源信息传入了进去
slotSharingGroupResources即registerSlotSharingGroup时的资源信息队列
private StreamGraphGenerator getStreamGraphGenerator(List<Transformation<?>> transformations) {
if (transformations.size() <= 0) {
throw new IllegalStateException(
"No operators defined in streaming topology. Cannot execute.");
}
return new StreamGraphGenerator(transformations, config, checkpointCfg, configuration)
.setStateBackend(defaultStateBackend)
.setChangelogStateBackendEnabled(changelogStateBackendEnabled)
.setSavepointDir(defaultSavepointDirectory)
.setChaining(isChainingEnabled)
.setUserArtifacts(cacheFile)
.setTimeCharacteristic(timeCharacteristic)
.setDefaultBufferTimeout(bufferTimeout)
.setSlotSharingGroupResource(slotSharingGroupResources);
}
2.3.2.StreamGraph
StreamGraph是Flink三张图中的第一张,一般在客户端生成传给服务端,是服务端生成作业的基础。调用在StreamGraphGenerator.generate的时候,资源信息传入StreamGraph
streamGraph.setSlotSharingGroupResource(slotSharingGroupResources);
2.3.3.JobVertex
JobVertex是JobGraph当中的概念,代表多个StreamNode经过chain形成的聚合体,其输入是JobEdge,输出是IntermediateDataSet。就是JobGraph当中的一个节点,最终会按并行度生成Task,一个JobVertex基本就对应一组Task。
调用StreamingJobGraphGenerator.setSlotSharing的时候,将资源信息传入JobVertex(在createJobGraph当中调用)
DEFAULT_SLOT_SHARING_GROUP即默认设置,slot均分资源的方式
SlotSharingGroup控制slot共享,相同SlotSharingGroup的可以在同一个slot中运行
if (slotSharingGroupKey.equals(StreamGraphGenerator.DEFAULT_SLOT_SHARING_GROUP)) {
// fallback to the region slot sharing group by default
effectiveSlotSharingGroup =
checkNotNull(vertexRegionSlotSharingGroups.get(vertex.getID()));
} else {
effectiveSlotSharingGroup =
specifiedSlotSharingGroups.computeIfAbsent(
slotSharingGroupKey,
k -> {
SlotSharingGroup ssg = new SlotSharingGroup();
streamGraph
.getSlotSharingGroupResource(k)
.ifPresent(ssg::setResourceProfile);
return ssg;
});
}
vertex.setSlotSharingGroup(effectiveSlotSharingGroup);
2.3.4.ExecutionJobVertex
ExecutionJobVertex是JobVertex对应的ExecutionGraph中的概念,SlotSharingGroup也会对应的传到ExecutionJobVertex当中
// take the sharing group
this.slotSharingGroup = checkNotNull(jobVertex.getSlotSharingGroup());
this.coLocationGroup = jobVertex.getCoLocationGroup();
3.处理方式
有一个问题:资源分配看上去都是从Job侧触发的,这跟声明式有什么差异
按声明式资源分配的说法,旧的机制是基于task分配的,声明式做了优化,按作业整体分配,这个细粒度分配看上去也是完成了这个优化(似乎是按group分配的)
3.1.实现类
cluster.fine-grained-resource-management.enabled为true开启细粒度资源分配
细粒度资源分配使用的实现类是FineGrainedSlotManager
声明式管理和细粒度管理不共用?也就是和自定义调度器不兼容
if (configuration.isEnableFineGrainedResourceManagement()) {
return new FineGrainedSlotManager(
scheduledExecutor,
slotManagerConfiguration,
slotManagerMetricGroup,
new DefaultResourceTracker(),
new FineGrainedTaskManagerTracker(),
new DefaultSlotStatusSyncer(
slotManagerConfiguration.getTaskManagerRequestTimeout()),
new DefaultResourceAllocationStrategy(
SlotManagerUtils.generateTaskManagerTotalResourceProfile(
slotManagerConfiguration.getDefaultWorkerResourceSpec()),
slotManagerConfiguration.getNumSlotsPerWorker()),
Time.milliseconds(REQUIREMENTS_CHECK_DELAY_MS));
} else {
return new DeclarativeSlotManager(
scheduledExecutor,
slotManagerConfiguration,
slotManagerMetricGroup,
new DefaultResourceTracker(),
new DefaultSlotTracker());
}
3.2.processResourceRequirements
处理接口应该是实现类的processResourceRequirements。主要为以下步骤
其中,resourceRequirements的targetAddress是指JobMaster的地址
if (resourceRequirements.getResourceRequirements().isEmpty()) {
jobMasterTargetAddresses.remove(resourceRequirements.getJobId());
} else {
jobMasterTargetAddresses.put(
resourceRequirements.getJobId(), resourceRequirements.getTargetAddress());
}
resourceTracker.notifyResourceRequirements(
resourceRequirements.getJobId(), resourceRequirements.getResourceRequirements());
checkResourceRequirementsWithDelay();
3.3.checkResourceRequirementsWithDelay
为了降低负载提升性能,对资源的需求检查采用延迟检查的方式,这样一次检查可以覆盖更多的变更。延迟时间目前是固定的50毫秒,不开放修改
private void checkResourceRequirementsWithDelay() {
if (requirementsCheckFuture == null || requirementsCheckFuture.isDone()) {
requirementsCheckFuture = new CompletableFuture<>();
scheduledExecutor.schedule(
() ->
mainThreadExecutor.execute(
() -> {
checkResourceRequirements();
Preconditions.checkNotNull(requirementsCheckFuture)
.complete(null);
}),
requirementsCheckDelay.toMilliseconds(),
TimeUnit.MILLISECONDS);
}
}
3.4.checkResourceRequirements
这个是实际的资源检查接口,会进行实际的资源分配
3.4.1.获取缺失资源信息
在processResourceRequirements当中,会将资源申请入列;本接口的后续处理步骤中,会有将已分配信息入列。此步骤是根据这两个信息,获取作业还缺失的资源信息
public Collection<ResourceRequirement> getMissingResources() {
final Collection<ResourceRequirement> missingResources = new ArrayList<>();
for (Map.Entry<ResourceProfile, Integer> requirement :
resourceRequirements.getResourcesWithCount()) {
ResourceProfile requirementProfile = requirement.getKey();
int numRequiredResources = requirement.getValue();
int numAcquiredResources =
resourceToRequirementMapping.getNumFulfillingResources(requirementProfile);
if (numAcquiredResources < numRequiredResources) {
missingResources.add(
ResourceRequirement.create(
requirementProfile, numRequiredResources - numAcquiredResources));
}
}
return missingResources;
}
3.4.2.资源适配
此步骤按资源信息进行资源适配,不进行实际的分配。在ResourceAllocationStrategy的tryFulfillRequirements中进行处理
这里定义了两个资源类型:registeredResources和pendingResources。registeredResources对应的应该是所有的注册的TaskManager。pendingResources的资源类初始创建是在后续流程的tryFulfillRequirementsForJobWithPendingResources里,如下
public ResourceAllocationResult tryFulfillRequirements(
Map<JobID, Collection<ResourceRequirement>> missingResources,
TaskManagerResourceInfoProvider taskManagerResourceInfoProvider) {
final ResourceAllocationResult.Builder resultBuilder = ResourceAllocationResult.builder();
final List<InternalResourceInfo> registeredResources =
getRegisteredResources(taskManagerResourceInfoProvider, resultBuilder);
final List<InternalResourceInfo> pendingResources =
getPendingResources(taskManagerResourceInfoProvider, resultBuilder);
for (Map.Entry<JobID, Collection<ResourceRequirement>> resourceRequirements :
missingResources.entrySet()) {
final JobID jobId = resourceRequirements.getKey();
final Collection<ResourceRequirement> unfulfilledJobRequirements =
tryFulfillRequirementsForJobWithResources(
jobId, resourceRequirements.getValue(), registeredResources);
if (!unfulfilledJobRequirements.isEmpty()) {
tryFulfillRequirementsForJobWithPendingResources(
jobId, unfulfilledJobRequirements, pendingResources, resultBuilder);
}
}
return resultBuilder.build();
}
3.4.3.tryFulfilledRequirementWithResource
这一步是适配资源时tryFulfillRequirementsForJobWithResources最终的点,进行实际的分配。分配方式是按TaskManager,优先在一个节点上进行分配,不足换TaskManager。
最后返回的numUnfulfilled代表此次分配没有满足资源的数量
注意这个方法是每个requiredResource调用一次,轮询调用的
private static int tryFulfilledRequirementWithResource(
List<InternalResourceInfo> internalResource,
int numUnfulfilled,
ResourceProfile requiredResource,
JobID jobId) {
final Iterator<InternalResourceInfo> internalResourceInfoItr = internalResource.iterator();
while (numUnfulfilled > 0 && internalResourceInfoItr.hasNext()) {
final InternalResourceInfo currentTaskManager = internalResourceInfoItr.next();
while (numUnfulfilled > 0
&& currentTaskManager.tryAllocateSlotForJob(jobId, requiredResource)) {
numUnfulfilled--;
}
if (currentTaskManager.availableProfile.equals(ResourceProfile.ZERO)) {
internalResourceInfoItr.remove();
}
}
return numUnfulfilled;
}
3.4.4.tryFulfillRequirementsForJobWithPendingResources
这是对上一节分配资源后还有资源没满足的再进行一次分配,分配方式还是一样的,只是分配的是PendingTaskManager的资源,并且多了后续处理步骤。后续处理步骤里会创建前文说的PendingTaskManager
PendingTaskManager的设计是为了支持对挂起任务执行器的动态分配(官方说法,带解读)
PendingTaskManager的创建依赖于totalResourceProfile,其设置来源于WorkerResourceSpecFactory的createDefaultWorkerResourceSpec,不同的资源管理器有不同的实现,SA模式的实现为0,也就是说不会用到PendingTaskManager
ArbitraryWorkerResourceSpecFactory的实现如下
public WorkerResourceSpec createDefaultWorkerResourceSpec(Configuration configuration) {
return WorkerResourceSpec.ZERO;
}
YarnWorkerResourceSpecFactory的实现如下,根据内存CPU资源构建了一个资源说明文件。按照这个逻辑理解,PendingTaskManager的作用是给具有动态申请资源的资源管理器使用的一个虚构概念,方便实际分配时动态扩展可以快速进行
protected WorkerResourceSpec workerResourceSpecFromConfigAndCpu(
Configuration configuration, CPUResource cpuResource) {
final TaskExecutorProcessSpec taskExecutorProcessSpec =
TaskExecutorProcessUtils.newProcessSpecBuilder(configuration)
.withCpuCores(cpuResource)
.build();
return WorkerResourceSpec.fromTaskExecutorProcessSpec(taskExecutorProcessSpec);
}
这里注意的是,此处会按设置的TaskManager的资源进行一整个申请,然后按这个资源量去进行适配,适配完如果还剩,会放入列表返回,最终都是放入resultBuilder,resultBuilder会在上层使用。
注意添加分两种:addPendingTaskManagerAllocate、addAllocationOnPendingResource
while (numUnfulfilled > 0
&& canFulfillRequirement(effectiveProfile, remainResource)) {
numUnfulfilled--;
resultBuilder.addAllocationOnPendingResource(
jobId,
newPendingTaskManager.getPendingTaskManagerId(),
effectiveProfile);
remainResource = remainResource.subtract(effectiveProfile);
}
if (!remainResource.equals(ResourceProfile.ZERO)) {
availableResources.add(
new InternalResourceInfo(
3.4.5.resultBuilder.build()
这一步就是将上诉的分配结果构建成对象,给下一步使用
public ResourceAllocationResult build() {
return new ResourceAllocationResult(
unfulfillableJobs,
allocationsOnRegisteredResources,
pendingTaskManagersToAllocate,
allocationsOnPendingResources);
}
3.5.allocateSlotsAccordingTo
这一步是安装上一步的分配结果进行实际分配,最终的分配在DefaultSlotStatusSyncer的allocateSlot。此步骤分配的资源范围仅包括已经存在的TaskManager的,也就是说不包括PendingTaskManager的内容
其过程就是根据TaskManager获取其gateway,然后调用gateway.requestSlot进行资源申请,之后就到TaskExecutor上进行资源分配及Job信息创建的过程
final Optional<TaskManagerInfo> taskManager =
taskManagerTracker.getRegisteredTaskManager(instanceId);
final TaskExecutorGateway gateway =
taskManager.get().getTaskExecutorConnection().getTaskExecutorGateway();
CompletableFuture<Acknowledge> requestFuture =
gateway.requestSlot(
3.6.allocateTaskManagersAccordingTo
这一步是对PendingTaskManager相关进行分配
首先判断资源是否达到可用上限,上限配置为CPU和Mem,分别由以下两个配置进行控制:slotmanager.max-total-resource.cpu、slotmanager.max-total-resource.memory
如果不配置,则由slotmanager.number-of-slots.max控制。这个划分有疑问,配置的内存等资源是对单个进程的,基于slotmanager.number-of-slots.max做划分后的意义不一样了?
if (isMaxTotalResourceExceededAfterAdding(pendingTaskManager.getTotalResourceProfile())) {
LOG.info(
"Could not allocate {}. Max total resource limitation <{}, {}> is reached.",
pendingTaskManager,
maxTotalCpu,
maxTotalMem.toHumanReadableString());
return false;
}
申请新资源。只有Yarn、K8S这样的动态扩展TaskManager的资源管理器才支持。当检查通过后,就去申请新的TaskManager。在ActiveResourceManager的requestNewWorker当中进行申请
申请完成后,会将其放入taskManagerTracker列表当中,与前面对应获取
taskManagerTracker.addPendingTaskManager(pendingTaskManager);
3.7.失败处理
后续有一些失败的处理,主要是做一些列表的更新
4.触发流程
主要看资源分配是从哪个动作触发的。分配流程的触发是有JobMaster发起的,追踪processResourceRequirements的调用链,源头在JobMaster的establishResourceManagerConnection
slotPoolService.connectToResourceManager(resourceManagerGateway);
establishResourceManagerConnection是RegisteredRpcConnection启动时触发注册调用的,ResourceManagerConnection继承自RegisteredRpcConnection,也就是说实际就是创建ResourceManagerConnection然后启动时触发,ResourceManagerConnection创建是由reconnectToResourceManager触发的,而reconnectToResourceManager有三个触发点,其中一个是ResourceManagerLeaderListener,启动应该是由其触发的
JobMaster是在启动的时候会注册监听器,在startJobMasterServices当中
resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());