目录
hdfs元数据保存到内存
hdfs元数据保存到磁盘
editlog执行流程
checkpoint触发条件
前言:
带着问题思考,从源码中获得答案
问题一:namenode的内存中目录树是什么数据结构,与zookeeper相同吗?
问题二:namenode的元数据写磁盘,会特别的慢吗?采用什么机制能够使这个过程加快。
问题三:namenode与journalnode是怎么通信的,不同的进程怎么进行通信的
问题四:namdenode上fsimage和editlog什么时候会进行合并,standbyNamenode什么时候合并fsimage和editlog,并且namenode什么时候去standbyNameNode中获取fsimage
问题五:为什么要用standbynamenode去合并fsimage和editlog,而不是用activeNamenode去合并
hdfs元数据保存到内存
- 通过一个创建目录的代码,来看一下元数据是怎么保存在内存中的,本源码针对hadoop-2.7.x
public static void main(String[] args) throws IOException, URISyntaxException, InterruptedException {
Configuration configuration=new Configuration();
FileSystem fileSystem= FileSystem.get( new URI( "hdfs://192.168.101.51:9000" ), configuration,"root" );
fileSystem.mkdirs(new Path("/user/mydata1"));
}
- 代码执行成功后,查看hdfs
- 大致的创建目录树的源码流程如图
下面是mkdirs时候,在目录树添加List<INode> children的过程。
- 首先进入FileSystem.java中fileSystem.mkdirs(Path f)
/**
* Call {@link #mkdirs(Path, FsPermission)} with default permission.
*/
public boolean mkdirs(Path f) throws IOException {
return mkdirs(f, FsPermission.getDirDefault());
}
- 再进入mkdirs,发现是一个抽象方法
/**
* Make the given file and all non-existent parents into
* directories. Has the semantics of Unix 'mkdir -p'.
* Existence of the directory hierarchy is not an error.
* @param f path to create
* @param permission to apply to f
*/
public abstract boolean mkdirs(Path f, FsPermission permission
) throws IOException;
- 找到实现类 DistributedFileSystem.java,调用mkdirsInternal(f,permission,true)
public boolean mkdirs(Path f, FsPermission permission) throws IOException {
return mkdirsInternal(f, permission, true);
}
- 最后通过DFSClient.java,namenode是ClientProtocol类,说明是个client远程调用server端
return namenode.mkdirs(src, absPermission, createParent);
- 找到NameNodeRPCServer.java 中mkdirs(src, FsPermission masked, boolean createParent)方法,
return namesystem.mkdirs(src,
new PermissionStatus(getRemoteUser().getShortUserName(),
null, masked), createParent);
- 进入到FSNamesystem.java
checkOperation(OperationCategory.WRITE);
//todo 如果是安全模式,不能创建目录
checkNameNodeSafeMode("Cannot create directory " + src);
auditStat = FSDirMkdirOp.mkdirs(this, src, permissions, createParent);
- 再进入到 FSDirMkdirOp.java
- 在这个方法中,fsn.getFSDirectory()获取了目录树,然后判断hdfs中要写入的目录路径lastNode是否是文件类型,是的话报错,然后获取路径,判断length,最后都调用createChildrenDirectories(fsd, existing, ancestors, addImplicitUwx(permissions, permissions));创建目录方法
static HdfsFileStatus mkdirs(FSNamesystem fsn, String src,
PermissionStatus permissions, boolean createParent) throws IOException {
FSDirectory fsd = fsn.getFSDirectory();
if(NameNode.stateChangeLog.isDebugEnabled()) {
NameNode.stateChangeLog.debug("DIR* NameSystem.mkdirs: " + src);
}
if (!DFSUtil.isValidName(src)) {
throw new InvalidPathException(src);
}
FSPermissionChecker pc = fsd.getPermissionChecker();
byte[][] pathComponents = FSDirectory.getPathComponentsForReservedPath(src);
fsd.writeLock();
try {
src = fsd.resolvePath(pc, src, pathComponents);
INodesInPath iip = fsd.getINodesInPath4Write(src);
if (fsd.isPermissionEnabled()) {
fsd.checkTraverse(pc, iip);
}
/**
* src /user/warehouse/hive/gmall.db/data
* lastINode /user/warehouse/hive/gmall.db
* 判断下最后的是不是文件,已经存在的
*/
final INode lastINode = iip.getLastINode();
if (lastINode != null && lastINode.isFile()) {
throw new FileAlreadyExistsException("Path is not a directory: " + src);
}
INodesInPath existing = lastINode != null ? iip : iip.getExistingINodes();
if (lastINode == null) {
if (fsd.isPermissionEnabled()) {
fsd.checkAncestorAccess(pc, iip, FsAction.WRITE);
}
if (!createParent) {
fsd.verifyParentDir(iip, src);
}
// validate that we have enough inodes. This is, at best, a
// heuristic because the mkdirs() operation might need to
// create multiple inodes.
fsn.checkFsObjectLimit();
//todo iip中没有存在的,哪些目录需要创建
List<String> nonExisting = iip.getPath(existing.length(),
iip.length() - existing.length());
int length = nonExisting.size();
if (length > 1) {
List<String> ancestors = nonExisting.subList(0, length - 1);
// Ensure that the user can traversal the path by adding implicit
// u+wx permission to all ancestor directories
existing = createChildrenDirectories(fsd, existing, ancestors,
addImplicitUwx(permissions, permissions));
if (existing == null) {
throw new IOException("Failed to create directory: " + src);
}
}
if ((existing = createChildrenDirectories(fsd, existing,
nonExisting.subList(length - 1, length), permissions)) == null) {
throw new IOException("Failed to create directory: " + src);
}
}
return fsd.getAuditFileInfo(existing);
} finally {
fsd.writeUnlock();
}
}
- 上一步根据判断路径length后,最后都要进入createChildrenDirectories,最后进入createSingleDirectory方法,在这个方法中unprotectedMkdir写入内存,fsd.getEditLog().logMkDir写入磁盘
private static INodesInPath createSingleDirectory(FSDirectory fsd,
INodesInPath existing, String localName, PermissionStatus perm)
throws IOException {
assert fsd.hasWriteLock();
//todo 将元信息写入内存
existing = unprotectedMkdir(fsd, fsd.allocateNewInodeId(), existing,
localName.getBytes(Charsets.UTF_8), perm, null, now());
if (existing == null) {
return null;
}
final INode newNode = existing.getLastINode();
// Directory creation also count towards FilesCreated
// to match count of FilesDeleted metric.
NameNode.getNameNodeMetrics().incrFilesCreated();
String cur = existing.getPath();
//todo 把元信息写到磁盘
fsd.getEditLog().logMkDir(cur, newNode);
if (NameNode.stateChangeLog.isDebugEnabled()) {
NameNode.stateChangeLog.debug("mkdirs: created directory " + cur);
}
return existing;
}
- 最后进入FSDirectory.java 添加child节点
/**
* Add a child to the end of the path specified by INodesInPath.
* @return an INodesInPath instance containing the new INode
*/
@VisibleForTesting
public INodesInPath addLastINode(INodesInPath existing, INode inode,
boolean checkQuota) throws QuotaExceededException {
assert existing.getLastINode() != null &&
existing.getLastINode().isDirectory();
final int pos = existing.length();
// Disallow creation of /.reserved. This may be created when loading
// editlog/fsimage during upgrade since /.reserved was a valid name in older
// release. This may also be called when a user tries to create a file
// or directory /.reserved.
if (pos == 1 && existing.getINode(0) == rootDir && isReservedName(inode)) {
throw new HadoopIllegalArgumentException(
"File name \"" + inode.getLocalName() + "\" is reserved and cannot "
+ "be created. If this is during upgrade change the name of the "
+ "existing file or directory to another name before upgrading "
+ "to the new release.");
}
//todo 获取父目录
final INodeDirectory parent = existing.getINode(pos - 1).asDirectory();
// The filesystem limits are not really quotas, so this check may appear
// odd. It's because a rename operation deletes the src, tries to add
// to the dest, if that fails, re-adds the src from whence it came.
// The rename code disables the quota when it's restoring to the
// original location because a quota violation would cause the the item
// to go "poof". The fs limits must be bypassed for the same reason.
if (checkQuota) {
final String parentPath = existing.getPath();
verifyMaxComponentLength(inode.getLocalNameBytes(), parentPath);
verifyMaxDirItems(parent, parentPath);
}
// always verify inode name
verifyINodeName(inode.getLocalNameBytes());
final QuotaCounts counts = inode.computeQuotaUsage(getBlockStoragePolicySuite());
updateCount(existing, pos, counts, checkQuota);
boolean isRename = (inode.getParent() != null);
boolean added;
try {
//todo 添加子目录
added = parent.addChild(inode, true, existing.getLatestSnapshotId());
} catch (QuotaExceededException e) {
updateCountNoQuotaCheck(existing, pos, counts.negation());
throw e;
}
if (!added) {
updateCountNoQuotaCheck(existing, pos, counts.negation());
return null;
} else {
if (!isRename) {
AclStorage.copyINodeDefaultAcl(inode);
}
addToInodeMap(inode);
}
return INodesInPath.append(existing, inode, inode.getLocalNameBytes());
}
- 最后在INodeDirectory.java中添加 INode.java,INode.java有两个实现类 文件
children.add(-insertionPoint - 1, node);
总结: 元数据管理FSDirectory.java管理内存中的目录树,目录树结构是个抽象类INode.java,INode.java有两个实现类,INodeDirectory.java/INodeFile.java,其中INodeDirectory中的 private List<INode> children = null; 子节点有两种类型,文件/目录。
所以问题一的答案是: 内存目录树的目录树是以List<chirldren> ,将每一层看作一个INode方式添加到list中去保存的,zookeeper以什么方式保存目录树等分析zookeeper源码时,一并查看。
hdfs元数据保存到磁盘
- 还是根据mkdirs()源码跟踪查找,元数据持久化磁盘的过程
还是先上个大致流程图
- 在上面看到,fsd.getEditLog().logMkDir方法写入到磁盘。这个过程在RSEditLog.java logEdit方法中实现
- 这个方法已经来就加了个锁,在synchronized(this){}同步代码块中,有一个txid,初始值为0,在beginTransaction()进行 txid++,其中 editLogStream.write(op)是抽象方法,其中两个实现类QuorumOutputStream/EditLogFileOutputStrem分别是写journalnode/namenode的元数据,他们最终是EditsDoubleBuffer.java把操作写入bufCurrent。
- 返回的editLogStream是两个流
/**
* Write an operation to the edit log. Do not sync to persistent
* store yet.
*/
void logEdit(final FSEditLogOp op) {
synchronized (this) {
assert isOpenForWrite() :
"bad state: " + state;
// wait if an automatic sync is scheduled
//todo 刚开始不用等待
waitIfAutoSyncScheduled();
long start = beginTransaction();
//todo 获取当前的事务id
op.setTransactionId(txid);
try {
//todo 把元数据写到内存缓存,和journalNode
//todo editLogStream有两个流
//todo QuorumJournalManager QuorumOutputStream journalnode
//todo FileJournalManager EditLogFileOutputStrem 本地磁盘
editLogStream.write(op);
} catch (IOException ex) {
// All journals failed, it is handled in logSync.
} finally {
op.reset();
}
endTransaction(start);
// check if it is time to schedule an automatic sync
if (!shouldForceSync()) {
return;
}
isAutoSyncScheduled = true;
}
// sync buffered edit log entries to persistent store
//todo 同步缓存edit log 到永久存储
logSync();
}
- 在logSync()方法中,进行了双缓存内存交换,将一个缓存中的数据写入磁盘,另一个继续接收client端数据。
- myTransactionId是个ThreadLocal,这就说明了txid(每个元数据操作)这个共享变量是线程安全的。
- 第一个线程进入到这个方法,一进来也是加了把锁,刚进来跳不进while (mytxid > synctxid && isSyncRunning),将syncStart = txid; isSyncRunning = true;在这个过程中上个加索的方法中可能将所有元数据操作都写入bufCurrent了,现在editLogStream.setReadyToFlush();进行内存交换,然后 logStream.flush();刷写到磁盘
public void logSync() {
long syncStart = 0;
// Fetch the transactionId of this thread.
long mytxid = myTransactionId.get().txid;
boolean sync = false;
try {
EditLogOutputStream logStream = null;
synchronized (this) {
try {
printStatistics(false);
// if somebody is already syncing, then wait
while (mytxid > synctxid && isSyncRunning) {
try {
wait(1000);
} catch (InterruptedException ie) {
}
}
//
// If this transaction was already flushed, then nothing to do
//
if (mytxid <= synctxid) {
numTransactionsBatchedInSync++;
if (metrics != null) {
// Metrics is non-null only when used inside name node
metrics.incrTransactionsBatchedInSync();
}
return;
}
// now, this thread will do the sync
syncStart = txid;
isSyncRunning = true;
sync = true;
// swap buffers
try {
if (journalSet.isEmpty()) {
throw new IOException("No journals available to flush");
}
//todo 交换内存
editLogStream.setReadyToFlush();
} catch (IOException e) {
final String msg =
"Could not sync enough journals to persistent storage " +
"due to " + e.getMessage() + ". " +
"Unsynced transactions: " + (txid - synctxid);
LOG.fatal(msg, new Exception());
synchronized(journalSetLock) {
IOUtils.cleanup(LOG, journalSet);
}
terminate(1, msg);
}
} finally {
// Prevent RuntimeException from blocking other log edit write
doneWithAutoSyncScheduling();
}
//editLogStream may become null,
//so store a local variable for flush.
logStream = editLogStream;
}
// do the sync
long start = monotonicNow();
try {
if (logStream != null) {
//todo 刷写到磁盘
logStream.flush();
}
} catch (IOException ex) {
synchronized (this) {
final String msg =
"Could not sync enough journals to persistent storage. "
+ "Unsynced transactions: " + (txid - synctxid);
LOG.fatal(msg, new Exception());
synchronized(journalSetLock) {
IOUtils.cleanup(LOG, journalSet);
}
terminate(1, msg);
}
}
long elapsed = monotonicNow() - start;
if (metrics != null) { // Metrics non-null only when used inside name node
metrics.addSync(elapsed);
}
} finally {
// Prevent RuntimeException from blocking other log edit sync
synchronized (this) {
if (sync) {
synctxid = syncStart;
isSyncRunning = false;
}
this.notifyAll();
}
}
}
- editLogStream.setReadyToFlush();中怎么进行内存交换???下面源码给出了答案
public void setReadyToFlush() {
//todo 交换内存
assert isFlushed() : "previous data not flushed yet";
TxnBuffer tmp = bufReady;
bufReady = bufCurrent;
bufCurrent = tmp;
}
问题二:答案也出来了,采用双缓存方案可以避免namenode元数据写入磁盘占用大量IO,导致速度变慢
问题三: namenode和journalnode之间是怎么通信的呢??刚才editLogStream.write(op),只看了EditLogFileOutputStrem类,也就是写入namenode元数据,现在看一下QuorumOutputStream类,看一下journalnode的元数据写入
loggers.sendEdits()方法进入AsyncLoggerSet.java是一个接口,那么实现类就是IPCLoggerChannel
@Override
protected void flushAndSync(boolean durable) throws IOException {
int numReadyBytes = buf.countReadyBytes();
if (numReadyBytes > 0) {
int numReadyTxns = buf.countReadyTxns();
long firstTxToFlush = buf.getFirstReadyTxId();
assert numReadyTxns > 0;
// Copy from our double-buffer into a new byte array. This is for
// two reasons:
// 1) The IPC code has no way of specifying to send only a slice of
// a larger array.
// 2) because the calls to the underlying nodes are asynchronous, we
// need a defensive copy to avoid accidentally mutating the buffer
// before it is sent.
DataOutputBuffer bufToSend = new DataOutputBuffer(numReadyBytes);
buf.flushTo(bufToSend);
assert bufToSend.getLength() == numReadyBytes;
byte[] data = bufToSend.getData();
assert data.length == bufToSend.getLength();
//todo 把数据写入journalnode
QuorumCall<AsyncLogger, Void> qcall = loggers.sendEdits(
segmentTxId, firstTxToFlush,
numReadyTxns, data);
//todo 这是一个阻塞的方法,等待journalnode集群处理结果
loggers.waitForWriteQuorum(qcall, writeTimeoutMs, "sendEdits");
// Since we successfully wrote this batch, let the loggers know. Any future
// RPCs will thus let the loggers know of the most recent transaction, even
// if a logger has fallen behind.
loggers.setCommittedTxId(firstTxToFlush + numReadyTxns - 1);
}
}
在IPCLoggerChannel中找到sendEdits()方法,其中调用getProxy().journal()。说明问题三种namenode与journalnode之间是通过RPC进行通信。journalnode将元信息进行落盘。
getProxy().journal(createReqInfo(),
segmentTxId, firstTxnId, numTxns, data);
editlog执行流程
checkpoint触发条件
待续。。。。。。分析问题四,问题五
EditLogTailer 是一个后台线程,启动了以后会周期的去journalnode集群上面去读取元数据日志,然后把这些元数据日志应用到自己的元数据里面(内存+磁盘)
EditLogTailThread()
standby namenode内存里数据要定时做checkpoint就是fsimage文件,替换active namenode中的fsimage文件
StandByCheckpoint : checkpoint条件 1.超过特定的条数、2.超过特定的时间