1. 背景
在https://blog.51cto.com/u_15327484/8023493、https://blog.51cto.com/u_15327484/8089923和https://blog.51cto.com/u_15327484/8095971三篇文章中,介绍了HDFS写文件在client、NameNode、DataNode组件侧的行为逻辑。
对于HDFS读文件流程来说相对简单:
- 获取HDFS文件起始的的block信息。
- 选择离客户端最近的datanode读取block。
- 当block读取完毕,ClientProtocol.getBlockLocations()读取下一个block位置信息。
- 所有block读取完全,调用HdfsDataInputStream.close()方法关闭输入流。
HDFS读文件流程如下:
2. HDFS读文件业务代码示例
如下所示,先调用FileSystem.get()创建DistributedFileSystem对象,再调用DistributedFileSystem.open打开文件,最后调用FSDataInputStream.readLine读取内容:
public void testRead() {
try {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path p = new Path("hdfs://localhost:9000/a.txt");
FSDataInputStream in = fs.open(p);
BufferedReader buff = new BufferedReader(new InputStreamReader(in));
String str = null;
while ((str = buff.readLine()) != null) {
System.out.println(str);
}
buff.close();
in.close();
} catch (IllegalArgumentException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
3. 客户端执行open操作
DistributedFileSystem调用DFSClient.open创建输入流FSDataInputStream:
DFSClient dfs;
public FSDataInputStream open(Path f, final int bufferSize)
throws IOException {
statistics.incrementReadOps(1);
storageStatistics.incrementOpCounter(OpType.OPEN);
Path absF = fixRelativePart(f);
return new FileSystemLinkResolver<FSDataInputStream>() {
@Override
public FSDataInputStream doCall(final Path p) throws IOException {
final DFSInputStream dfsis =
dfs.open(getPathName(p), bufferSize, verifyChecksum);
return dfs.createWrappedInputStream(dfsis);
}
@Override
public FSDataInputStream next(final FileSystem fs, final Path p)
throws IOException {
return fs.open(p, bufferSize);
}
}.resolve(this, absF);
}
DFSClient.open先默认获取10个blocks位置信息,再根据这些blocks创建DFSInputStream:
public DFSInputStream open(String src, int buffersize, boolean verifyChecksum)
throws IOException {
checkOpen();
// Get block info from namenode
try (TraceScope ignored = newPathTraceScope("newDFSInputStream", src)) {
//获取前10个blocks信息
LocatedBlocks locatedBlocks = getLocatedBlocks(src, 0);
//创建DFSInputStream
return openInternal(locatedBlocks, src, verifyChecksum);
}
}
//创建DFSInputStream
private DFSInputStream openInternal(LocatedBlocks locatedBlocks, String src,
boolean verifyChecksum) throws IOException {
//省略
return new DFSInputStream(this, src, verifyChecksum, locatedBlocks);
}
DFSClient.getLocatedBlocks()方法用于获取HDFS文件从[start,dfsClientConf.getPrefetchSize()]区间对应的blocks,start为0,dfsClientConf.getPrefetchSize()为10*BlockSize。即每次客户端获取10个blocks信息:
public LocatedBlocks getLocatedBlocks(String src, long start)
throws IOException {
return getLocatedBlocks(src, start, dfsClientConf.getPrefetchSize());
}
最终调用ClientProtocol.getBlockLocations获取前10个blocks信息:
static LocatedBlocks callGetBlockLocations(ClientProtocol namenode,
String src, long start, long length)
throws IOException {
try {
return namenode.getBlockLocations(src, start, length);
} catch(RemoteException re) {
throw re.unwrapRemoteException(AccessControlException.class,
FileNotFoundException.class,
UnresolvedPathException.class);
}
}
DFSInputStream包含blocks信息,路径信息:
DFSInputStream(DFSClient dfsClient, String src, boolean verifyChecksum,
LocatedBlocks locatedBlocks) throws IOException {
this.dfsClient = dfsClient;
this.verifyChecksum = verifyChecksum;
this.src = src;
synchronized (infoLock) {
this.cachingStrategy = dfsClient.getDefaultReadCachingStrategy();
}
this.locatedBlocks = locatedBlocks;
openInfo(false);
}
void openInfo(boolean refreshLocatedBlocks) throws IOException {
final DfsClientConf conf = dfsClient.getConf();
synchronized(infoLock) {
lastBlockBeingWrittenLength =
fetchLocatedBlocksAndGetLastBlockLength(refreshLocatedBlocks);
int retriesForLastBlockLength = conf.getRetryTimesForGetLastBlockLength();
while (retriesForLastBlockLength > 0) {
// Getting last block length as -1 is a special case. When cluster
// restarts, DNs may not report immediately. At this time partial block
// locations will not be available with NN for getting the length. Lets
// retry for 3 times to get the length.
if (lastBlockBeingWrittenLength == -1) {
DFSClient.LOG.warn("Last block locations not available. "
+ "Datanodes might not have reported blocks completely."
+ " Will retry for " + retriesForLastBlockLength + " times");
waitFor(conf.getRetryIntervalForGetLastBlockLength());
//获取最后一个block长度
lastBlockBeingWrittenLength =
fetchLocatedBlocksAndGetLastBlockLength(true);
} else {
break;
}
retriesForLastBlockLength--;
}
if (lastBlockBeingWrittenLength == -1
&& retriesForLastBlockLength == 0) {
throw new IOException("Could not obtain the last block locations.");
}
}
}
4. 客户端执行read操作
执行DFSInputStream.read(),会途径DFSInputStream.readWithStrategy()方法,blockSeekTo方法负责获取第pos的block对应的最近的datanode,readBuffer负责从datanode中读取block信息:
protected synchronized int readWithStrategy(ReaderStrategy strategy)
throws IOException {
dfsClient.checkOpen();
if (closed.get()) {
throw new IOException("Stream closed");
}
int len = strategy.getTargetLength();
CorruptedBlocks corruptedBlocks = new CorruptedBlocks();
failures = 0;
if (pos < getFileLength()) {
int retries = 2;
while (retries > 0) {
try {
// currentNode can be left as null if previous read had a checksum
// error on the same block. See HDFS-3067
if (pos > blockEnd || currentNode == null) {
//获取第pos的block对应的最近的datanode
currentNode = blockSeekTo(pos);
}
//省略
//读取datanode中的block信息
int result = readBuffer(strategy, realLen, corruptedBlocks);
if (result >= 0) {
pos += result;
} else {
// got a EOS from reader though we expect more data on it.
throw new IOException("Unexpected EOS from the reader");
}
//省略
return -1;
}
blockSeekTo获取block对应的优先级最高的dn,并构建blockReader:
private synchronized DatanodeInfo blockSeekTo(long target)
throws IOException {
//获取第target个block
LocatedBlock targetBlock = getBlockAt(target);
//选择block优先级最高的dn
DNAddrPair retval = chooseDataNode(targetBlock, null);
chosenNode = retval.info;
InetSocketAddress targetAddr = retval.addr;
StorageType storageType = retval.storageType;
// Latest block if refreshed by chooseDatanode()
targetBlock = retval.block;
//构建block对于的BlockReader对象
blockReader = getBlockReader(targetBlock, offsetIntoBlock,
targetBlock.getBlockSize() - offsetIntoBlock, targetAddr,
storageType, chosenNode);
}
getBlockReader用于构建blockReader,如果client在dn上,就构建BlockReaderLocal读取本地block,否则就通过BlockReaderRemote读取远程datanode中的block:
if (scConf.isShortCircuitLocalReads() && allowShortCircuitLocalReads) {
if (clientContext.getUseLegacyBlockReaderLocal()) {
reader = getLegacyBlockReaderLocal();
if (reader != null) {
LOG.trace("{}: returning new legacy block reader local.", this);
return reader;
}
} else {
reader = getBlockReaderLocal();
if (reader != null) {
LOG.trace("{}: returning new block reader local.", this);
return reader;
}
}
}
if (scConf.isDomainSocketDataTraffic()) {
reader = getRemoteBlockReaderFromDomain();
if (reader != null) {
LOG.trace("{}: returning new remote block reader using UNIX domain "
+ "socket on {}", this, pathInfo.getPath());
return reader;
}
}
它会调用Sender.readBlock方法,发送READ_BLOCK请求:
public static BlockReader newBlockReader(String file,
ExtendedBlock block,
Token<BlockTokenIdentifier> blockToken,
long startOffset, long len,
boolean verifyChecksum,
String clientName,
Peer peer, DatanodeID datanodeID,
PeerCache peerCache,
CachingStrategy cachingStrategy,
int networkDistance) throws IOException {
// in and out will be closed when sock is closed (by the caller)
final DataOutputStream out = new DataOutputStream(new BufferedOutputStream(
peer.getOutputStream()));
new Sender(out).readBlock(block, blockToken, clientName, startOffset, len,
verifyChecksum, cachingStrategy);
public void readBlock(final ExtendedBlock blk,
final Token<BlockTokenIdentifier> blockToken,
final String clientName,
final long blockOffset,
final long length,
final boolean sendChecksum,
final CachingStrategy cachingStrategy) throws IOException {
OpReadBlockProto proto = OpReadBlockProto.newBuilder()
.setHeader(DataTransferProtoUtil.buildClientHeader(blk, clientName,
blockToken))
.setOffset(blockOffset)
.setLen(length)
.setSendChecksums(sendChecksum)
.setCachingStrategy(getCachingStrategy(cachingStrategy))
.build();
//发送READ_BLOCK请求
send(out, Op.READ_BLOCK, proto);
}
本文以BlockReaderRemote远程读取为例进行研究。
调用DFSInputStream.read()方法,最终会调用BlockReaderRemote.read()方法,它进入readNextPacket方法读取packet:
public synchronized int read(ByteBuffer buf) throws IOException {
if (curDataSlice == null ||
(curDataSlice.remaining() == 0 && bytesNeededToFinish > 0)) {
//读取packet
readNextPacket();
}
//省略
return nRead;
}
readNextPacket读取packet,并通过checksum进行校验:
private void readNextPacket() throws IOException {
//Read packet headers.
//读取数据包头和数据包
packetReceiver.receiveNextPacket(in);
PacketHeader curHeader = packetReceiver.getHeader();
curDataSlice = packetReceiver.getDataSlice();
assert curDataSlice.capacity() == curHeader.getDataLen();
LOG.trace("DFSClient readNextPacket got header {}", curHeader);
// Sanity check the lengths
if (!curHeader.sanityCheck(lastSeqNo)) {
throw new IOException("BlockReader: error in packet header " +
curHeader);
}
if (curHeader.getDataLen() > 0) {
int chunks = 1 + (curHeader.getDataLen() - 1) / bytesPerChecksum;
int checksumsLen = chunks * checksumSize;
assert packetReceiver.getChecksumSlice().capacity() == checksumsLen :
"checksum slice capacity=" +
packetReceiver.getChecksumSlice().capacity() +
" checksumsLen=" + checksumsLen;
lastSeqNo = curHeader.getSeqno();
if (verifyChecksum && curDataSlice.remaining() > 0) {
// N.B.: the checksum error offset reported here is actually
// relative to the start of the block, not the start of the file.
// This is slightly misleading, but preserves the behavior from
// the older BlockReader.
//通过chunksum检查读取的数据是否正确
checksum.verifyChunkedSums(curDataSlice,
packetReceiver.getChecksumSlice(),
filename, curHeader.getOffsetInBlock());
}
bytesNeededToFinish -= curHeader.getDataLen();
}
}
5. DataNode处理READ_BLOCK请求
根据https://blog.51cto.com/u_15327484/8095971中对于DataNode线程模型的分析,处理读请求时,DataNode线程模型如下:
在DataNode中,由DataXceiver.readBlock处理客户端的READ_BLOCK请求。readBlock方法主要构建BlockSender,调用sendBlock方法发送数据:
public void readBlock(final ExtendedBlock block,
final Token<BlockTokenIdentifier> blockToken,
final String clientName,
final long blockOffset,
final long length,
final boolean sendChecksum,
final CachingStrategy cachingStrategy) throws IOException {
previousOpClientName = clientName;
long read = 0;
updateCurrentThreadName("Sending block " + block);
OutputStream baseStream = getOutputStream();
DataOutputStream out = getBufferedOutputStream();
checkAccess(out, true, block, blockToken, Op.READ_BLOCK,
BlockTokenIdentifier.AccessMode.READ);
// send the block
BlockSender blockSender = null;
//省略
blockSender = new BlockSender(block, blockOffset, length,
true, false, sendChecksum, datanode, clientTraceFmt,
cachingStrategy);
//省略
read = blockSender.sendBlock(out, baseStream, null); // send data
//省略
}
调用doSendBlock方法发送数据:
private long doSendBlock(DataOutputStream out, OutputStream baseStream,
DataTransferThrottler throttler) throws IOException {
//省略
ByteBuffer pktBuf = ByteBuffer.allocate(pktBufSize);
//省略
while (endOffset > offset && !Thread.currentThread().isInterrupted()) {
manageOsCache();
long len = sendPacket(pktBuf, maxChunksPerPacket, streamForSendChunks,
transferTo, throttler);
offset += len;
totalRead += len + (numberOfChunks(len) * checksumSize);
seqno++;
}
//省略
return totalRead;
}
在sendPacket方法中,有一处可以关注,它读取dfs.datanode.transferTo.allowed配置,如果为true,则进行零拷贝:
if (transferTo) {
SocketOutputStream sockOut = (SocketOutputStream)out;
// First write header and checksums
sockOut.write(buf, headerOff, dataOff - headerOff);
// no need to flush since we know out is not a buffered stream
FileChannel fileCh = ((FileInputStream)ris.getDataIn()).getChannel();
LongWritable waitTime = new LongWritable();
LongWritable transferTime = new LongWritable();
fileIoProvider.transferToSocketFully(
ris.getVolumeRef().getVolume(), sockOut, fileCh, blockInPosition,
dataLen, waitTime, transferTime);
datanode.metrics.addSendDataPacketBlockedOnNetworkNanos(waitTime.get());
datanode.metrics.addSendDataPacketTransferNanos(transferTime.get());
blockInPosition += dataLen;
} else {
// normal transfer
out.write(buf, headerOff, dataOff + dataLen - headerOff);
}
零拷贝表示不需要应用程序流转,直接将内核中的数据读取到socket buffer中:
如果不开启dfs.datanode.transferTo.allowed配置,会通过应用程序的buf进行中转,效率不高:
if (!transferTo) { // normal transfer
try {
ris.readDataFully(buf, dataOff, dataLen);
} catch (IOException ioe) {
if (ioe.getMessage().startsWith(EIO_ERROR)) {
throw new DiskFileCorruptException("A disk IO error occurred", ioe);
}
throw ioe;
}
if (verifyChecksum) {
verifyChecksum(buf, dataOff, dataLen, numChunks, checksumOff);
}
}
这就是传统的数据传输方式: