Spark Shuffle

原创

hyunbar777 2021-08-02 14:04:42 ©著作权

©著作权归作者所有：来自51CTO博客作者hyunbar777的原创作品，请联系作者获取转载授权，否则将追究法律责任

在所有的 MapReduce 框架中, Shuffle 是连接 map 任务和 reduce 任务的桥梁map任务的中间输出作为Reduce任务的输入，就必须经过Shuffle，所以Shuffle性能的优劣直接决定了整个计算引擎的性能和吞吐量Shuffle 是所有 MapReduce 计算框架必须面临的执行阶段, Shuffle 用于打通 map 任务的输出与reduce 任务的输入map 任务的中间输出结果按照指定的分区策略(例如, 按照 key 的哈希值)分配给处理某一个分区的 reduce 任务.

通用的 MapReduce 框架:

Spark Shuffle_数据

1Shuffle 核心

Spark Shuffle_ide_02

在划分Stage时：最后一个Stage称为finalStage（变量名），它是一个ResultState类型的对象。前面所有Stage称为ShuffleMapStageShuffleMapStage 的结束伴随着shuffle文件的写磁盘ResultStage 对应着action算子，即讲一个函数应用在RDD的各个partition 的数据集上，意味着一个job的运行结束

1.1 CoarseGrainedExecutorBackend

启动任务：

override def receive: PartialFunction[Any, Unit] = {   ...    case LaunchTask(data) =>        if (executor == null) {          ...          } else {            val taskDesc = ser.deserialize[TaskDescription](data.value)            // 启动任务            executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber,                taskDesc.name, taskDesc.serializedTask)        }}

executor.launchTask：

def launchTask(                  context: ExecutorBackend,                  taskId: Long,                  attemptNumber: Int,                  taskName: String,                  serializedTask: ByteBuffer): Unit = {    // Runnable 接口的对象.    val tr = new TaskRunner(context, taskId = taskId, attemptNumber = attemptNumber, taskName,        serializedTask)    runningTasks.put(taskId, tr)    // 在线程池中执行 task    threadPool.execute(tr)}

tr.run方法

override def run(): Unit = {    // 更新 task 的状态    execBackend.statusUpdate(taskId, TaskState.RUNNING, EMPTY_BYTE_BUFFER)    try {        // 把任务相关的数据反序列化出来        val (taskFiles, taskJars, taskProps, taskBytes) =            Task.deserializeWithDependencies(serializedTask)        ...        val value = try {            // 开始运行 Task            val res = task.run(                taskAttemptId = taskId,                attemptNumber = attemptNumber,                metricsSystem = env.metricsSystem)            res        } finally {          ...        }    } catch {    ...    } finally {     ...    }}

Task.run 方法

final def run(                 taskAttemptId: Long,                 attemptNumber: Int,                 metricsSystem: MetricsSystem): T = {    context = new TaskContextImpl(            stageId,            partitionId,            taskAttemptId,            attemptNumber,            taskMemoryManager,            localProperties,            metricsSystem,            metrics)    try {        // 运行任务        runTask(context)    } catch {        ...    } finally {       ...    }}

Task.runTask是一个抽象方法.

Task 有两个实现类, 分别执行不同阶段的Task

Spark Shuffle_ide_03

1.2 ShuffleMapTask源码分析

ShuffleMapTask.runTask 方法

override def runTask(context: TaskContext): MapStatus = {    // Deserialize the RDD using the broadcast variable.    val threadMXBean = ManagementFactory.getThreadMXBean    val deserializeStartTime = System.currentTimeMillis()    val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {        threadMXBean.getCurrentThreadCpuTime    } else 0L    val ser = SparkEnv.get.closureSerializer.newInstance()    val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](        ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)    _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime    _executorDeserializeCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {        threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime    } else 0L    // 核心代码:  ShuffleWriter负责写需要 shuffle 的数据    var writer: ShuffleWriter[Any, Any] = null    try {        val manager: ShuffleManager = SparkEnv.get.shuffleManager        // 获取到 ShuffleWriter 对象        writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)        // 写出 RDD 中的数据.  然后reduce阶段的task去读        writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])        writer.stop(success = true).get    } catch {        case e: Exception =>            try {                if (writer != null) {                    writer.stop(success = false)                }            } catch {                case e: Exception =>                    log.debug("Could not stop writer", e)            }            throw e    }}

ShuffleWriter是一个抽象类, 有 3 个实现

Spark Shuffle_缓存_04

1.3 ShuffleManager

ShuffleManage 是一个Trait, 从2.0.0开始就只有一个实现类了: SortShuffleManager

Spark Shuffle_css_05

registerShuffle 方法: 匹配出来使用哪种ShuffleHandle

/**  * Register a shuffle with the manager and obtain a handle for it to pass to tasks.  */override def registerShuffle[K, V, C](                                         shuffleId: Int,                                         numMaps: Int,                                         dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {    if (SortShuffleWriter.shouldBypassMergeSort(SparkEnv.get.conf, dependency)) {        // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't        // need map-side aggregation, then write numPartitions files directly and just concatenate        // them at the end. This avoids doing serialization and deserialization twice to merge        // together the spilled files, which would happen with the normal code path. The downside is        // having multiple files open at a time and thus more memory allocated to buffers.        new BypassMergeSortShuffleHandle[K, V](            shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])    } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {        // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:        new SerializedShuffleHandle[K, V](            shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])    } else {        // Otherwise, buffer map outputs in a deserialized form:        new BaseShuffleHandle(shuffleId, numMaps, dependency)    }}

getWrite方法

/** Get a writer for a given partition. Called on executors by map tasks. */override def getWriter[K, V](                                handle: ShuffleHandle,                                mapId: Int,                                context: TaskContext): ShuffleWriter[K, V] = {    numMapsForShuffle.putIfAbsent(        handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)    val env = SparkEnv.get    // 根据不同的 Handle, 创建不同的 ShuffleWriter    handle match {        case unsafeShuffleHandle: SerializedShuffleHandle[K@unchecked, V@unchecked] =>            new UnsafeShuffleWriter(                env.blockManager,                shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],                context.taskMemoryManager(),                unsafeShuffleHandle,                mapId,                context,                env.conf)        case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K@unchecked, V@unchecked] =>            new BypassMergeSortShuffleWriter(                env.blockManager,                shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],                bypassMergeSortHandle,                mapId,                context,                env.conf)        case other: BaseShuffleHandle[K@unchecked, V@unchecked, _] =>            new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)    }}

2HashShuffle（了解）

Spark-1.6 之前默认的shuffle方式是hash. 在 spark-1.6版本之后使用Sort-Base Shuffle，

Spark2.0之后, 从源码中完全移除了HashShuffle.

HashShuffle

Spark Shuffle_ide_06

为了方便分析假设前提：每个 Executor 只有 1 个CPU core，也就是说，无论这个 Executor 上分配多少个 task 线程，同一时间都只能执行一个 task 线程。

如下图中有 3个 Reducer，从 Task 开始那边各自把自己进行 Hash 计算(分区器：hash/numreduce取模)，分类出3个不同的类别，每个 Task 都分成3种类别的数据，想把不同的数据汇聚然后计算出最终的结果，所以Reducer 会在每个 Task 中把属于自己类别的数据收集过来，汇聚成一个同类别的大集合，每1个 Task 输出3份本地文件，这里有4个 Mapper Tasks，所以总共输出文件：4 x 3 = 12

缺点：

map 任务的中间结果首先存入内存(缓存), 然后才写入磁盘. 这对于内存的开销很大, 当一个节点上 map 任务的输出结果集很大时, 很容易导致内存紧张, 发送 OOM
生成很多的小文件. 假设有 M 个 MapTask, 有 N 个 ReduceTask, 则会创建 M * n 个小文件, 磁盘 I/O 将成为性能瓶颈.

优化后的HashShuffle

Spark Shuffle_缓存_07

优化的 HashShuffle 过程就是启用合并机制，合并机制就是复用buffer，开启合并机制的配置是spark.shuffle.consolidateFiles。该参数默认值为false，将其设置为true即可开启优化机制。通常来说，如果我们使用HashShuffleManager，那么都建议开启这个选项。

这里还是有 4 个Tasks，数据类别还是分成 3 种类型，因为Hash算法会根据你的 Key 进行分类，在同一个进程中，无论是有多少过Task，都会把同样的Key放在同一个Buffer里，然后把Buffer中的数据写入以Core数量为单位的本地文件中，(一个Core只有一种类型的Key的数据)，每1个Task所在的进程中，分别写入共同进程中的3份本地文件，这里有4个Mapper Tasks，所以总共输出是 2个Cores x 3个分类文件 = 6个本地小文件。

3SoftShuffle

普通SortShuffle

在该模式下，数据会先写入一个数据结构，reduceByKey写入Map，一遍通过Map局部聚合，一遍写入内存。

Join算子写ArrayList直接写入内存中。

然后需要判断是否达到阈值，如果达到就会将内存数据结构的数据写入到磁盘，清空内存数据结构。

Spark Shuffle_缓存_08

在溢写磁盘前，先根据 key 进行排序，排序过后的数据，会分批写入到磁盘文件中。默认批次为 10000 条，数据会以每批一万条写入到磁盘文件。写入磁盘文件通过缓冲区溢写的方式，每次溢写都会产生一个磁盘文件，也就是说一个 Task 过程会产生多个临时文件。

最后在每个 Task 中，将所有的临时文件合并，这就是merge过程，此过程将所有临时文件读取出来，一次写入到最终文件。意味着一个Task的所有数据都在这一个文件中。同时单独写一份索引文件，标识下游各个Task的数据在文件中的索引，start offset和end offset。

普通 SortShuffle 源码解析

SortShuffleWriter.write方法

/** Write a bunch of records to this task's output */override def write(records: Iterator[Product2[K, V]]): Unit = {    // 排序器    sorter = if (dep.mapSideCombine) {        require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")        new ExternalSorter[K, V, C](            context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)    } else {        new ExternalSorter[K, V, V](            context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)    }    // 将 Map 任务的输出记录插入到缓存中    sorter.insertAll(records)    // 数据 shuffle 数据文件    val output: File = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)    val tmp = Utils.tempFileWith(output)    try {     // 将 map 端缓存的数据写入到磁盘中, 并生成 Block 文件对应的索引文件.        val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)        // 记录各个分区数据的长度        val partitionLengths = sorter.writePartitionedFile(blockId, tmp)        // 生成 Block 文件对应的索引文件. 此索引文件用于记录各个分区在 Block文件中的偏移量, 以便于        // Reduce 任务拉取时使用        shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)        mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)    } finally {        if (tmp.exists() && !tmp.delete()) {            logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")        }    }}

bypassSortShuffle

bypass运行机制的触发条件如下(必须同时满足)：

shuffle map task数量小于spark.shuffle.sort.bypassMergeThreshold参数的值，默认为200。
不是Map-side的聚合算子（比如reduceByKey）。

此时 task 会为每个 reduce 端的 task 都创建一个临时磁盘文件，并将数据按 key 进行 hash 然后根据key 的 hash 值，将 key 写入对应的磁盘文件之中。当然，写入磁盘文件时也是先写入内存缓冲，缓冲写满之后再溢写到磁盘文件的。最后，同样会将所有临时磁盘文件都合并成一个磁盘文件，并创建一个单独的索引文件。

Spark Shuffle_ide_09

bypass SortShuffle 源码解析

有时候, map 端不需要在持久化数据之前进行排序等操作, 那么 ShuffleWriter的实现类之一BypassMergeSortShuffleWriter 就可以派上用场了.

private[spark] object SortShuffleWriter {    // 判断是否启用 Bypasssortshuffle    def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {        // We cannot bypass sorting if we need to do map-side aggregation.        if (dep.mapSideCombine) {            require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")            false        } else {            val bypassMergeThreshold: Int = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)            dep.partitioner.numPartitions <= bypassMergeThreshold        }    }}