上篇文章分析了hadoop写文件的流程,既然明白了文件是怎么写入的,再来理解读就顺畅一些了。
同样的,本文主要探讨客户端的实现,同样的,我依然推荐读一下,读文件的大致流程如下:
不论是文件读取,还是文件的写入,主控服务器扮演的都是中介的角色。客户端把自己的需求提交给主控服务器,主控服务器挑选合适的数据服务器,介绍给客户端,让客户端和数据服务器单聊,要读要写随你们便。这种策略类似于DMA,降低了主控服务器的负载,提高了效率。。。
因此,在文件读写操作中,最主要的通信,发生在客户端与数据服务器之间。它们之间跑的协议是ClientDatanodeProtocol。从这个协议中间,你无法看到和读写相关的接口,因为,在Hadoop中,读写操作是不走RPC机制的,而是另立门户,独立搭了一套通信框架。在数据服务器一端,DataNode类中有一个DataXceiverServer类的实例,它在一个单独的线程等待请求,一旦接到,就启动一个DataXceiver的线程,处理此次请求。一个请求一个线程,对于数据服务器来说,逻辑上很简单。当下,DataXceiver支持的请求类型有六种,具体的请求包和回复包格式,请参见这里,这里,这里。在Hadoop的实现中,并没有用类来封装这些请求,而是按流的次序写下来,这给代码阅读带来挺多的麻烦,也对代码的维护带来一定的困难,不知道是出于何种考虑。。。
相比于写,文件的读取实在是一个简单的过程。在客户端DFSClient中,有一个DFSClient.DFSInputStream类。当需要读取一个文件的时候,会生成一个DFSInputStream的实例。它会先调用ClientProtocol定义getBlockLocations接口,提供给NameNode文件路径、读取位置、读取长度信息,从中取得一个LocatedBlocks类的对象,这个对象包含一组LocatedBlock,那里面有所规定位置中包含的所有数据块信息,以及数据块对应的所有数据服务器的位置信息。当读取开始后,DFSInputStream会先尝试从某个数据块对应的一组数据服务器中选出一个,进行连接。这个选取算法,在当下的实现中,非常简单,就是选出第一个未挂的数据服务器,并没有加入客户端与数据服务器相对位置的考量。读取的请求,发送到数据服务器后,自然会有DataXceiver来处理,数据被一个包一个包发送回客户端,等到整个数据块的数据都被读取完了,就会断开此链接,尝试连接下一个数据块对应的数据服务器,整个流程,依次如此反复,直到所有想读的都读取完了为止。。。
跟写文件类似,读文件的主要逻辑在DFSInputStream类中。先看下构造函数:
DFSInputStream(DFSClient dfsClient, String src, int buffersize, boolean verifyChecksum
) throws IOException, UnresolvedLinkException {
this.dfsClient = dfsClient;
this.verifyChecksum = verifyChecksum;
this.buffersize = buffersize;
this.src = src;
this.socketCache = dfsClient.socketCache;
prefetchSize = dfsClient.getConf().prefetchSize;
timeWindow = dfsClient.getConf().timeWindow;
nCachedConnRetry = dfsClient.getConf().nCachedConnRetry;
openInfo();
}
在写文件的准备工作方法openInfo中会向namenode读取被打开文件所需的所有BlockId。
LocatedBlocks newInfo = dfsClient.getLocatedBlocks(src, 0, prefetchSize);
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("newInfo = " + newInfo);
}
if (newInfo == null) {
throw new IOException("Cannot open filename " + src);
}
if (locatedBlocks != null) {
Iterator<LocatedBlock> oldIter = locatedBlocks.getLocatedBlocks().iterator();
Iterator<LocatedBlock> newIter = newInfo.getLocatedBlocks().iterator();
while (oldIter.hasNext() && newIter.hasNext()) {
if (! oldIter.next().getBlock().equals(newIter.next().getBlock())) {
throw new IOException("Blocklist for " + src + " has changed!");
}
}
}
locatedBlocks = newInfo;
long lastBlockBeingWrittenLength = 0;
if (!locatedBlocks.isLastBlockComplete()) {
final LocatedBlock last = locatedBlocks.getLastLocatedBlock();
if (last != null) {
if (last.getLocations().length == 0) {
return -1;
}
final long len = readBlockLength(last);
last.getBlock().setNumBytes(len);
lastBlockBeingWrittenLength = len;
}
}
currentNode = null;
return lastBlockBeingWrittenLength;
1.dfsClient.getLocatedBlocks方法实际调用了namenode.getBlockLocations返回所有的blockId。
2.查看blockId信息是否已被cache,如没有则将cache赋值。
3.判断该文件是否isLastBlockComplete,在hadoop中写文件实际是把block写入到datanode中,而namenode是通过datanode定期的汇报得知该文件到底由哪几个block组成的。因此,在读某个文件时可能存在datanode还未汇报给namenode的情况,因此,我们在读文件时只能读到最后一个汇报的block块。
下面看下read方法。
public synchronized int read(final byte buf[], int off, int len) throws IOException {
ReaderStrategy byteArrayReader = new ByteArrayStrategy(buf);
return readWithStrategy(byteArrayReader, off, len);
}
首先会创建一个ByteArrayStrategy的reader,这种reader会将block依次读到buf数组中,hadoop还提供一个ByteBufferStrategy用来支持NIO模式的读。
然后执行readWithStrategy。
try {
// currentNode can be left as null if previous read had a checksum
// error on the same block. See HDFS-3067
if (pos > blockEnd || currentNode == null) {
currentNode = blockSeekTo(pos);
}
int realLen = (int) Math.min(len, (blockEnd - pos + 1L));
int result = readBuffer(strategy, off, realLen, corruptedBlockMap);
if (result >= 0) {
pos += result;
} else {
// got a EOS from reader though we expect more data on it.
throw new IOException("Unexpected EOS from the reader");
}
if (dfsClient.stats != null && result != -1) {
dfsClient.stats.incrementBytesRead(result);
}
return result;
1.pos指当前读文件的偏移量,首先根据pos获取当前应该读的block对象
2.根据block对象向namenode询问有哪些datanode拥有该block,选择需要读取的datanode
3.建立client-datanode的链接,创建BlockReader。
4.readBuffer方法调用不同的readStrategy的doRead方法从block中读取想要的数据。
下面对上面三步分别解释下:
1.获取block对象
public int findBlock(long offset) {
// create fake block of size 0 as a key
LocatedBlock key = new LocatedBlock(
new ExtendedBlock(), new DatanodeInfo[0], 0L, false);
key.setStartOffset(offset);
key.getBlock().setNumBytes(1);
Comparator<LocatedBlock> comp =
new Comparator<LocatedBlock>() {
// Returns 0 iff a is inside b or b is inside a
@Override
public int compare(LocatedBlock a, LocatedBlock b) {
long aBeg = a.getStartOffset();
long bBeg = b.getStartOffset();
long aEnd = aBeg + a.getBlockSize();
long bEnd = bBeg + b.getBlockSize();
if(aBeg <= bBeg && bEnd <= aEnd
|| bBeg <= aBeg && aEnd <= bEnd)
return 0; // one of the blocks is inside the other
if(aBeg < bBeg)
return -1; // a's left bound is to the left of the b's
return 1;
}
};
return Collections.binarySearch(blocks, key, comp);
}
获得Block对象的核心方法是findBlock方法。通过比较各个block在整个文件中的位移来确定当前位移在哪个block中
2.获得datanode
static DatanodeInfo bestNode(DatanodeInfo nodes[],
AbstractMap<DatanodeInfo, DatanodeInfo> deadNodes)
throws IOException {
if (nodes != null) {
for (int i = 0; i < nodes.length; i++) {
if (!deadNodes.containsKey(nodes[i])) {
return nodes[i];
}
}
}
throw new IOException("No live nodes contain current block");
}
获得best的datanode很简单。就是遍历该block所有的datanode,按顺序取最前面的。不过,其实在namenode端返回datanodeList时就是按照优先级顺序返回的。
private DNAddrPair chooseDataNode(LocatedBlock block)
throws IOException {
while (true) {
DatanodeInfo[] nodes = block.getLocations();
try {
DatanodeInfo chosenNode = bestNode(nodes, deadNodes);
final String dnAddr =
chosenNode.getXferAddr(dfsClient.connectToDnViaHostname());
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("Connecting to datanode " + dnAddr);
}
InetSocketAddress targetAddr = NetUtils.createSocketAddr(dnAddr);
return new DNAddrPair(chosenNode, targetAddr);
} catch (IOException ie) {
String blockInfo = block.getBlock() + " file=" + src;
if (failures >= dfsClient.getMaxBlockAcquireFailures()) {
throw new BlockMissingException(src, "Could not obtain block: " + blockInfo,
block.getStartOffset());
}
if (nodes == null || nodes.length == 0) {
DFSClient.LOG.info("No node available for block: " + blockInfo);
}
DFSClient.LOG.info("Could not obtain block " + block.getBlock()
+ " from any node: " + ie
+ ". Will get new block locations from namenode and retry...");
try {
// Introducing a random factor to the wait time before another retry.
// The wait time is dependent on # of failures and a random factor.
// At the first time of getting a BlockMissingException, the wait time
// is a random number between 0..3000 ms. If the first retry
// still fails, we will wait 3000 ms grace period before the 2nd retry.
// Also at the second retry, the waiting window is expanded to 6000 ms
// alleviating the request rate from the server. Similarly the 3rd retry
// will wait 6000ms grace period before retry and the waiting window is
// expanded to 9000ms.
double waitTime = timeWindow * failures + // grace period for the last round of attempt
timeWindow * (failures + 1) * DFSUtil.getRandom().nextDouble(); // expanding time window for each failure
DFSClient.LOG.warn("DFS chooseDataNode: got # " + (failures + 1) + " IOException, will wait for " + waitTime + " msec.");
Thread.sleep((long)waitTime);
} catch (InterruptedException iex) {
}
deadNodes.clear(); //2nd option is to remove only nodes[blockId]
openInfo();
block = getBlockAt(block.getStartOffset(), false);
failures++;
continue;
}
}
}
这边主要是创建datanode的socket失败的重试,采取了graceful的sleep方式,可以学习一下,不过好像实际sleep的方式跟注释中描述的不一样。
3.创建BlockReader
// Can't local read a block under construction, see HDFS-2757
if (dfsClient.shouldTryShortCircuitRead(dnAddr) &&
!blockUnderConstruction()) {
return DFSClient.getLocalBlockReader(dfsClient.conf, src, block,
blockToken, chosenNode, dfsClient.hdfsTimeout, startOffset,
dfsClient.connectToDnViaHostname());
}
ShortCircuitRead是hadoop的一个优化。在client与datanode在同一台机器时会直接读本地文件而不是通过socket向datanode读取block。
// Allow retry since there is no way of knowing whether the cached socket
// is good until we actually use it.
for (int retries = 0; retries <= nCachedConnRetry && fromCache; ++retries) {
SocketAndStreams sockAndStreams = null;
// Don't use the cache on the last attempt - it's possible that there
// are arbitrarily many unusable sockets in the cache, but we don't
// want to fail the read.
if (retries < nCachedConnRetry) {
sockAndStreams = socketCache.get(dnAddr);
}
Socket sock;
if (sockAndStreams == null) {
fromCache = false;
sock = dfsClient.socketFactory.createSocket();
// TCP_NODELAY is crucial here because of bad interactions between
// Nagle's Algorithm and Delayed ACKs. With connection keepalive
// between the client and DN, the conversation looks like:
// 1. Client -> DN: Read block X
// 2. DN -> Client: data for block X
// 3. Client -> DN: Status OK (successful read)
// 4. Client -> DN: Read block Y
// The fact that step #3 and #4 are both in the client->DN direction
// triggers Nagling. If the DN is using delayed ACKs, this results
// in a delay of 40ms or more.
//
// TCP_NODELAY disables nagling and thus avoids this performance
// disaster.
sock.setTcpNoDelay(true);
NetUtils.connect(sock, dnAddr,
dfsClient.getRandomLocalInterfaceAddr(),
dfsClient.getConf().socketTimeout);
sock.setSoTimeout(dfsClient.getConf().socketTimeout);
} else {
sock = sockAndStreams.sock;
}
try {
// The OP_READ_BLOCK request is sent as we make the BlockReader
BlockReader reader =
BlockReaderFactory.newBlockReader(dfsClient.getConf(),
sock, file, block,
blockToken,
startOffset, len,
bufferSize, verifyChecksum,
clientName,
dfsClient.getDataEncryptionKey(),
sockAndStreams == null ? null : sockAndStreams.ioStreams);
return reader;
} catch (IOException ex) {
// Our socket is no good.
DFSClient.LOG.debug("Error making BlockReader. Closing stale " + sock, ex);
if (sockAndStreams != null) {
sockAndStreams.close();
} else {
sock.close();
}
err = ex;
}
首先对socket进行了初始化。这边设置了TCPNODELAY。因为client-datanode的交互是严格时序的。如果不设置client会非常慢。
BlockReaderFactory.newBlockReader创建BlockReader对象。后面就通过这个reader来依次读出block的内容。
4.BlockReader读block
public synchronized int read(byte[] buf, int off, int len)
throws IOException {
if (curDataSlice == null || curDataSlice.remaining() == 0 && bytesNeededToFinish > 0) {
readNextPacket();
}
if (curDataSlice.remaining() == 0) {
// we're at EOF now
return -1;
}
int nRead = Math.min(curDataSlice.remaining(), len);
curDataSlice.get(buf, off, nRead);
return nRead;
//Read packet headers.
packetReceiver.receiveNextPacket(in);
PacketHeader curHeader = packetReceiver.getHeader();
curDataSlice = packetReceiver.getDataSlice();
assert curDataSlice.capacity() == curHeader.getDataLen();
if (LOG.isTraceEnabled()) {
LOG.trace("DFSClient readNextPacket got header " + curHeader);
}
// Sanity check the lengths
if (!curHeader.sanityCheck(lastSeqNo)) {
throw new IOException("BlockReader: error in packet header " +
curHeader);
}
if (curHeader.getDataLen() > 0) {
int chunks = 1 + (curHeader.getDataLen() - 1) / bytesPerChecksum;
int checksumsLen = chunks * checksumSize;
assert packetReceiver.getChecksumSlice().capacity() == checksumsLen :
"checksum slice capacity=" + packetReceiver.getChecksumSlice().capacity() +
" checksumsLen=" + checksumsLen;
lastSeqNo = curHeader.getSeqno();
if (verifyChecksum && curDataSlice.remaining() > 0) {
// N.B.: the checksum error offset reported here is actually
// relative to the start of the block, not the start of the file.
// This is slightly misleading, but preserves the behavior from
// the older BlockReader.
checksum.verifyChunkedSums(curDataSlice,
packetReceiver.getChecksumSlice(),
filename, curHeader.getOffsetInBlock());
}
bytesNeededToFinish -= curHeader.getDataLen();
}
// First packet will include some data prior to the first byte
// the user requested. Skip it.
if (curHeader.getOffsetInBlock() < startOffset) {
int newPos = (int) (startOffset - curHeader.getOffsetInBlock());
curDataSlice.position(newPos);
}
// If we've now satisfied the whole client read, read one last packet
// header, which should be empty
if (bytesNeededToFinish <= 0) {
readTrailingEmptyPacket();
if (verifyChecksum) {
sendReadResult(Status.CHECKSUM_OK);
} else {
sendReadResult(Status.SUCCESS);
}
}
这边调用了RemoteBlockReader2的read方法读取block。这边的逻辑也不简单,datanode把block按一个个packet发送过来,每发送一个packet都需要checkSum校验数据正确性和检查packet的头数据,把实际data数据放到curDataSlice中,数据正确后会发送response给datanode,只有收到了client的ack信息datanode才会发送下一个packet。
刚读这边代码有一个奇怪的地方。如果用户从block中读一段数据到字符数据中时,如果读的长度超过block的大小,超过的部分不会被读到。不过后来仔细看了下,DFSInputStream的readWithStrategy方法每次都只会读一个int型,所以就不会出现此问题了。
总结一下:
1.客户端采用了read,ack的强时序模式而且没有用线程来receive数据,保证了读的正确性,但也稍微降低了读的效率。
2.socket采用了NIO提高了效率,这可能是与较早版本最大的改进。
3.虽然说0.92版本shortCircuit没有实现,但看cdh4代码中是有的。不知道如何测试是否能够使用。