如下时序图表示了RDD.persist方法执行之后,Spark是如何cache分区数据的。时序图可放大显示

spark 远程 Shuffle spark 远程 cache_数据

本篇文章中,RDD.persist(StorageLevel)参数StorageLevel为:MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)

也就是cache数据的时候,如果有足够的内存则将数据cache到内存,如果没有足够的内存,则可将数据cache到磁盘,分区数据不会cache到堆外内存,cache的数据是序列化的,并且cache的数据需要备份到远程其它节点。远程节点采用默认Netty服务方式(NettyBlockRpcServer)读取远程节点数据块。


ShuffleMapTask或者ResultTask通过调用RDD.iterator来计算一个分区的数据,如果在Driver执行了RDD.persist方法,则会进入到CacheManager.getOrCompute计算一个分区的数据。如果一个分区的数据已经Cache了,则根据块ID(块ID根据RDD id和分区index创建)从BlockManager中读取分区的数据。

读取BlockManager中块数据是通过调用BlockManager.get方法实现的,BlockManager.get方法首先调用BlockManager.getLocal方法尽量从本地节点读取块数据,如果本地节点没有要读取的块数据则需要调用BlockManager.getRemote方法到其它远程节点读取。BlockManager.get源码如下:


def get(blockId: BlockId): Option[BlockResult] = {
    //优先从本地节点获取块
    val local = getLocal(blockId)
    if (local.isDefined) {
      logInfo(s"Found block $blockId locally")
      return local
    }
    //本地节点查询不到则到其它节点获取块
    val remote = getRemote(blockId)
    if (remote.isDefined) {
      logInfo(s"Found block $blockId remotely")
      return remote
    }
    None
  }


BlockManager.getLocal方法最终调用BlockManager.doGetLocal从本地节点读取数据块。在从本地读取数据块的时候,尽量从内存中读取,如果内存中没有数据块则到磁盘中读取,从磁盘中读取数据块之后会把数据块的信息cache到内存。BlockManager.doGetLocal源码如下:


private def doGetLocal(blockId: BlockId, asBlockResult: Boolean): Option[Any] = {
    val info = blockInfo.get(blockId).orNull
    if (info != null) {
      info.synchronized {
        // Double check to make sure the block is still there. There is a small chance that the
        // block has been removed by removeBlock (which also synchronizes on the blockInfo object).
        // Note that this only checks metadata tracking. If user intentionally deleted the block
        // on disk or from off heap storage without using removeBlock, this conditional check will
        // still pass but eventually we will get an exception because we can't find the block.
        if (blockInfo.get(blockId).isEmpty) {
          logWarning(s"Block $blockId had been removed")
          return None
        }

        // If another thread is writing the block, wait for it to become ready.
        if (!info.waitForReady()) {
          // If we get here, the block write failed.
          logWarning(s"Block $blockId was marked as failure.")
          return None
        }

        val level = info.level
        logDebug(s"Level for block $blockId is $level")

        // Look for the block in memory
        if (level.useMemory) {
          logDebug(s"Getting block $blockId from memory")
          val result = if (asBlockResult) {
            memoryStore.getValues(blockId).map(new BlockResult(_, DataReadMethod.Memory, info.size))
          } else {
            //返回的数据是经过blockManager.dataSerialize序列化后的数据
            memoryStore.getBytes(blockId)
          }
          result match {
            case Some(values) =>
              //如果从内存中找到数据块,则直接成功返回,否则去磁盘和堆外内存中去找这个数据块
              return result
            case None =>
              logDebug(s"Block $blockId not found in memory")
          }
        }

        // Look for the block in external block store
        if (level.useOffHeap) {
          logDebug(s"Getting block $blockId from ExternalBlockStore")
          if (externalBlockStore.contains(blockId)) {
            val result = if (asBlockResult) {
              externalBlockStore.getValues(blockId)
                .map(new BlockResult(_, DataReadMethod.Memory, info.size))
            } else {
              externalBlockStore.getBytes(blockId)
            }
            result match {
              case Some(values) =>
                return result
              case None =>
                logDebug(s"Block $blockId not found in ExternalBlockStore")
            }
          }
        }

        // Look for block on disk, potentially storing it back in memory if required
        if (level.useDisk) {
          logDebug(s"Getting block $blockId from disk")
          val bytes: ByteBuffer = diskStore.getBytes(blockId) match {
            case Some(b) => b
            case None =>
              throw new BlockException(
                blockId, s"Block $blockId not found on disk, though it should be")
          }
          assert(0 == bytes.position())

          if (!level.useMemory) {
            // If the block shouldn't be stored in memory, we can just return it
            if (asBlockResult) {
              //如果块数据存储级别不包括存到内存,则直接返回
              return Some(new BlockResult(dataDeserialize(blockId, bytes), DataReadMethod.Disk,
                info.size))
            } else {
              return Some(bytes)
            }
          } else {
            // Otherwise, we also have to store something in the memory store
            //如果块数据存储级别保罗存到内存,则从磁盘取出的块数据需要cache到内存
            if (!level.deserialized || !asBlockResult) {
              /* We'll store the bytes in memory if the block's storage level includes
               * "memory serialized", or if it should be cached as objects in memory
               * but we only requested its serialized bytes. */
              memoryStore.putBytes(blockId, bytes.limit, () => {
                // https://issues.apache.org/jira/browse/SPARK-6076
                // If the file size is bigger than the free memory, OOM will happen. So if we cannot
                // put it into MemoryStore, copyForMemory should not be created. That's why this
                // action is put into a `() => ByteBuffer` and created lazily.
                val copyForMemory = ByteBuffer.allocate(bytes.limit)
                copyForMemory.put(bytes)
              })
              bytes.rewind()
            }
            if (!asBlockResult) {
              return Some(bytes)
            } else {
              val values = dataDeserialize(blockId, bytes)
              if (level.deserialized) {
                // Cache the values before returning them
                val putResult = memoryStore.putIterator(
                  blockId, values, level, returnValues = true, allowPersistToDisk = false)
                // The put may or may not have succeeded, depending on whether there was enough
                // space to unroll the block. Either way, the put here should return an iterator.
                putResult.data match {
                  case Left(it) =>
                    return Some(new BlockResult(it, DataReadMethod.Disk, info.size))
                  case _ =>
                    // This only happens if we dropped the values back to disk (which is never)
                    throw new SparkException("Memory store did not return an iterator!")
                }
              } else {
                return Some(new BlockResult(values, DataReadMethod.Disk, info.size))
              }
            }
          }
        }
      }
    } else {
      logDebug(s"Block $blockId not registered locally")
    }
    None
  }



如果本地节点读取不到要读取的数据块则到其它远程节点获取数据块,BlockManager.getRemote实际通过BlockManager.doGetRemote方法从远程节点读取数据块。

BlockManager.doGetRemote首先调用BlockManagerMaster.getLocations方法,BlockManagerMaster.getLocations方法调用AkkaRpcEndpointRef.askWithRetry方法向驱动发送GetLocations消息,驱动收到GetLoactions消息后将一个数据块所在的所有节点返回回来。为了保证存储数据块的节点间负载均衡,从返回的节点序列中,随机选择一个节点作为获取数据块的源节点,然后调用NettyBlockTransferService.fetchBlockSync从源节点读取数据块。

接下来读取数据块的处理流程跟Shuffle reduce操作读取数据块的流程几乎一样,时序图不再画了。

NettyBlockTransferService.fetchBlock方法读取远程NettyBlockRpcServer服务提供的数据块的时序图:

在这里远程节点接收到OpenBlocks消息后BlockManager.getBlockData方法对读取数据块的处理是跟Shuffle是不同的,它调用了BlockManager.doGetLocal来处理数据块读取请求,BlockManager.doGetLocal代码如下:


override def getBlockData(blockId: BlockId): ManagedBuffer = {
    if (blockId.isShuffle) {
      /*
      * 对于Shuffle block的读取,首先根据index文件读取到Block在Shuffle数据文件的位置
      * */
      shuffleManager.shuffleBlockResolver.getBlockData(blockId.asInstanceOf[ShuffleBlockId])
    } else {
      //非Shuffle数据块的处理
      val blockBytesOpt = doGetLocal(blockId, asBlockResult = false)
        .asInstanceOf[Option[ByteBuffer]]
      if (blockBytesOpt.isDefined) {
        val buffer = blockBytesOpt.get
        new NioManagedBuffer(buffer)
      } else {
        throw new BlockNotFoundException(blockId.toString)
      }
    }
  }



至此,存储到BlockManager的数据块的读取分析完