目录

hdfs元数据保存到内存

hdfs元数据保存到磁盘

editlog执行流程

checkpoint触发条件


前言:

带着问题思考,从源码中获得答案

问题一:namenode的内存中目录树是什么数据结构,与zookeeper相同吗?

问题二:namenode的元数据写磁盘,会特别的慢吗?采用什么机制能够使这个过程加快。

问题三:namenode与journalnode是怎么通信的,不同的进程怎么进行通信的

问题四:namdenode上fsimage和editlog什么时候会进行合并,standbyNamenode什么时候合并fsimage和editlog,并且namenode什么时候去standbyNameNode中获取fsimage 

问题五:为什么要用standbynamenode去合并fsimage和editlog,而不是用activeNamenode去合并

hdfs元数据保存到内存

  • 通过一个创建目录的代码,来看一下元数据是怎么保存在内存中的,本源码针对hadoop-2.7.x
public static void main(String[] args) throws IOException, URISyntaxException, InterruptedException {
        Configuration configuration=new Configuration();
        FileSystem fileSystem= FileSystem.get( new URI( "hdfs://192.168.101.51:9000" ), configuration,"root" );

        fileSystem.mkdirs(new Path("/user/mydata1"));

    }
  • 代码执行成功后,查看hdfs

hdfs获得namenode主备_hdfs

  • 大致的创建目录树的源码流程如图

hdfs获得namenode主备_hdfs获得namenode主备_02

下面是mkdirs时候,在目录树添加List<INode> children的过程。

  • 首先进入FileSystem.java中fileSystem.mkdirs(Path f)
/**
   * Call {@link #mkdirs(Path, FsPermission)} with default permission.
   */
  public boolean mkdirs(Path f) throws IOException {
    return mkdirs(f, FsPermission.getDirDefault());
  }
  • 再进入mkdirs,发现是一个抽象方法
/**
   * Make the given file and all non-existent parents into
   * directories. Has the semantics of Unix 'mkdir -p'.
   * Existence of the directory hierarchy is not an error.
   * @param f path to create
   * @param permission to apply to f
   */
  public abstract boolean mkdirs(Path f, FsPermission permission
      ) throws IOException;
  • 找到实现类 DistributedFileSystem.java,调用mkdirsInternal(f,permission,true)
public boolean mkdirs(Path f, FsPermission permission) throws IOException {
    return mkdirsInternal(f, permission, true);
  }
  • 最后通过DFSClient.java,namenode是ClientProtocol类,说明是个client远程调用server端
return namenode.mkdirs(src, absPermission, createParent);
  • 找到NameNodeRPCServer.java  中mkdirs(src, FsPermission masked, boolean createParent)方法,
return namesystem.mkdirs(src,
        new PermissionStatus(getRemoteUser().getShortUserName(),
            null, masked), createParent);
  • 进入到FSNamesystem.java
checkOperation(OperationCategory.WRITE);
      //todo 如果是安全模式,不能创建目录
      checkNameNodeSafeMode("Cannot create directory " + src);

      auditStat = FSDirMkdirOp.mkdirs(this, src, permissions, createParent);
  • 再进入到 FSDirMkdirOp.java
  • 在这个方法中,fsn.getFSDirectory()获取了目录树,然后判断hdfs中要写入的目录路径lastNode是否是文件类型,是的话报错,然后获取路径,判断length,最后都调用createChildrenDirectories(fsd, existing, ancestors, addImplicitUwx(permissions, permissions));创建目录方法
static HdfsFileStatus mkdirs(FSNamesystem fsn, String src,
      PermissionStatus permissions, boolean createParent) throws IOException {
    FSDirectory fsd = fsn.getFSDirectory();
    if(NameNode.stateChangeLog.isDebugEnabled()) {
      NameNode.stateChangeLog.debug("DIR* NameSystem.mkdirs: " + src);
    }
    if (!DFSUtil.isValidName(src)) {
      throw new InvalidPathException(src);
    }
    FSPermissionChecker pc = fsd.getPermissionChecker();
    byte[][] pathComponents = FSDirectory.getPathComponentsForReservedPath(src);
    fsd.writeLock();
    try {
      src = fsd.resolvePath(pc, src, pathComponents);
      INodesInPath iip = fsd.getINodesInPath4Write(src);
      if (fsd.isPermissionEnabled()) {
        fsd.checkTraverse(pc, iip);
      }
      /**
       * src /user/warehouse/hive/gmall.db/data
       * lastINode /user/warehouse/hive/gmall.db
       * 判断下最后的是不是文件,已经存在的
       */
      final INode lastINode = iip.getLastINode();
      if (lastINode != null && lastINode.isFile()) {
        throw new FileAlreadyExistsException("Path is not a directory: " + src);
      }

      INodesInPath existing = lastINode != null ? iip : iip.getExistingINodes();
      if (lastINode == null) {
        if (fsd.isPermissionEnabled()) {
          fsd.checkAncestorAccess(pc, iip, FsAction.WRITE);
        }

        if (!createParent) {
          fsd.verifyParentDir(iip, src);
        }

        // validate that we have enough inodes. This is, at best, a
        // heuristic because the mkdirs() operation might need to
        // create multiple inodes.
        fsn.checkFsObjectLimit();
        //todo iip中没有存在的,哪些目录需要创建
        List<String> nonExisting = iip.getPath(existing.length(),
            iip.length() - existing.length());
        int length = nonExisting.size();
        if (length > 1) {
          List<String> ancestors = nonExisting.subList(0, length - 1);
          // Ensure that the user can traversal the path by adding implicit
          // u+wx permission to all ancestor directories
          existing = createChildrenDirectories(fsd, existing, ancestors,
              addImplicitUwx(permissions, permissions));
          if (existing == null) {
            throw new IOException("Failed to create directory: " + src);
          }
        }

        if ((existing = createChildrenDirectories(fsd, existing,
            nonExisting.subList(length - 1, length), permissions)) == null) {
          throw new IOException("Failed to create directory: " + src);
        }
      }
      return fsd.getAuditFileInfo(existing);
    } finally {
      fsd.writeUnlock();
    }
  }

 

  • 上一步根据判断路径length后,最后都要进入createChildrenDirectories,最后进入createSingleDirectory方法,在这个方法中unprotectedMkdir写入内存,fsd.getEditLog().logMkDir写入磁盘
private static INodesInPath createSingleDirectory(FSDirectory fsd,
      INodesInPath existing, String localName, PermissionStatus perm)
      throws IOException {
    assert fsd.hasWriteLock();
    //todo 将元信息写入内存
    existing = unprotectedMkdir(fsd, fsd.allocateNewInodeId(), existing,
        localName.getBytes(Charsets.UTF_8), perm, null, now());
    if (existing == null) {
      return null;
    }

    final INode newNode = existing.getLastINode();
    // Directory creation also count towards FilesCreated
    // to match count of FilesDeleted metric.
    NameNode.getNameNodeMetrics().incrFilesCreated();

    String cur = existing.getPath();
    //todo 把元信息写到磁盘
    fsd.getEditLog().logMkDir(cur, newNode);
    if (NameNode.stateChangeLog.isDebugEnabled()) {
      NameNode.stateChangeLog.debug("mkdirs: created directory " + cur);
    }
    return existing;
  }

 

  • 最后进入FSDirectory.java 添加child节点
/**
   * Add a child to the end of the path specified by INodesInPath.
   * @return an INodesInPath instance containing the new INode
   */
  @VisibleForTesting
  public INodesInPath addLastINode(INodesInPath existing, INode inode,
      boolean checkQuota) throws QuotaExceededException {
    assert existing.getLastINode() != null &&
        existing.getLastINode().isDirectory();

    final int pos = existing.length();
    // Disallow creation of /.reserved. This may be created when loading
    // editlog/fsimage during upgrade since /.reserved was a valid name in older
    // release. This may also be called when a user tries to create a file
    // or directory /.reserved.
    if (pos == 1 && existing.getINode(0) == rootDir && isReservedName(inode)) {
      throw new HadoopIllegalArgumentException(
          "File name \"" + inode.getLocalName() + "\" is reserved and cannot "
              + "be created. If this is during upgrade change the name of the "
              + "existing file or directory to another name before upgrading "
              + "to the new release.");
    }
    //todo 获取父目录
    final INodeDirectory parent = existing.getINode(pos - 1).asDirectory();
    // The filesystem limits are not really quotas, so this check may appear
    // odd. It's because a rename operation deletes the src, tries to add
    // to the dest, if that fails, re-adds the src from whence it came.
    // The rename code disables the quota when it's restoring to the
    // original location because a quota violation would cause the the item
    // to go "poof".  The fs limits must be bypassed for the same reason.
    if (checkQuota) {
      final String parentPath = existing.getPath();
      verifyMaxComponentLength(inode.getLocalNameBytes(), parentPath);
      verifyMaxDirItems(parent, parentPath);
    }
    // always verify inode name
    verifyINodeName(inode.getLocalNameBytes());

    final QuotaCounts counts = inode.computeQuotaUsage(getBlockStoragePolicySuite());
    updateCount(existing, pos, counts, checkQuota);

    boolean isRename = (inode.getParent() != null);
    boolean added;
    try {
      //todo 添加子目录
      added = parent.addChild(inode, true, existing.getLatestSnapshotId());
    } catch (QuotaExceededException e) {
      updateCountNoQuotaCheck(existing, pos, counts.negation());
      throw e;
    }
    if (!added) {
      updateCountNoQuotaCheck(existing, pos, counts.negation());
      return null;
    } else {
      if (!isRename) {
        AclStorage.copyINodeDefaultAcl(inode);
      }
      addToInodeMap(inode);
    }
    return INodesInPath.append(existing, inode, inode.getLocalNameBytes());
  }
  • 最后在INodeDirectory.java中添加 INode.java,INode.java有两个实现类 文件
children.add(-insertionPoint - 1, node);

总结: 元数据管理FSDirectory.java管理内存中的目录树,目录树结构是个抽象类INode.java,INode.java有两个实现类,INodeDirectory.java/INodeFile.java,其中INodeDirectory中的 private List<INode> children = null; 子节点有两种类型,文件/目录。

所以问题一的答案是: 内存目录树的目录树是以List<chirldren> ,将每一层看作一个INode方式添加到list中去保存的,zookeeper以什么方式保存目录树等分析zookeeper源码时,一并查看。

hdfs元数据保存到磁盘

  • 还是根据mkdirs()源码跟踪查找,元数据持久化磁盘的过程

还是先上个大致流程图

hdfs获得namenode主备_hdfs_03

 

  • 在上面看到,fsd.getEditLog().logMkDir方法写入到磁盘。这个过程在RSEditLog.java  logEdit方法中实现
  • 这个方法已经来就加了个锁,在synchronized(this){}同步代码块中,有一个txid,初始值为0,在beginTransaction()进行 txid++,其中 editLogStream.write(op)是抽象方法,其中两个实现类QuorumOutputStream/EditLogFileOutputStrem分别是写journalnode/namenode的元数据,他们最终是EditsDoubleBuffer.java把操作写入bufCurrent。
  • 返回的editLogStream是两个流
/**
   * Write an operation to the edit log. Do not sync to persistent
   * store yet.
   */
  void logEdit(final FSEditLogOp op) {
    synchronized (this) {
      assert isOpenForWrite() :
        "bad state: " + state;
      
      // wait if an automatic sync is scheduled
      //todo 刚开始不用等待
      waitIfAutoSyncScheduled();
      
      long start = beginTransaction();
      //todo 获取当前的事务id
      op.setTransactionId(txid);

      try {
        //todo 把元数据写到内存缓存,和journalNode
        //todo editLogStream有两个流
        //todo QuorumJournalManager QuorumOutputStream journalnode
        //todo FileJournalManager EditLogFileOutputStrem  本地磁盘
        editLogStream.write(op);
      } catch (IOException ex) {
        // All journals failed, it is handled in logSync.
      } finally {
        op.reset();
      }

      endTransaction(start);
      
      // check if it is time to schedule an automatic sync
      if (!shouldForceSync()) {
        return;
      }
      isAutoSyncScheduled = true;
    }
   // sync buffered edit log entries to persistent store
    //todo 同步缓存edit log 到永久存储
    logSync();
  }
  • 在logSync()方法中,进行了双缓存内存交换,将一个缓存中的数据写入磁盘,另一个继续接收client端数据。
  • myTransactionId是个ThreadLocal,这就说明了txid(每个元数据操作)这个共享变量是线程安全的。
  • 第一个线程进入到这个方法,一进来也是加了把锁,刚进来跳不进while (mytxid > synctxid && isSyncRunning),将syncStart = txid; isSyncRunning = true;在这个过程中上个加索的方法中可能将所有元数据操作都写入bufCurrent了,现在editLogStream.setReadyToFlush();进行内存交换,然后 logStream.flush();刷写到磁盘
public void logSync() {
    long syncStart = 0;

    // Fetch the transactionId of this thread. 
    long mytxid = myTransactionId.get().txid;
    
    boolean sync = false;
    try {
      EditLogOutputStream logStream = null;
      synchronized (this) {
        try {
          printStatistics(false);

          // if somebody is already syncing, then wait
          while (mytxid > synctxid && isSyncRunning) {
            try {
              wait(1000);
            } catch (InterruptedException ie) {
            }
          }
  
          //
          // If this transaction was already flushed, then nothing to do
          //
          if (mytxid <= synctxid) {
            numTransactionsBatchedInSync++;
            if (metrics != null) {
              // Metrics is non-null only when used inside name node
              metrics.incrTransactionsBatchedInSync();
            }
            return;
          }
     
          // now, this thread will do the sync
          syncStart = txid;
          isSyncRunning = true;
          sync = true;
  
          // swap buffers
          try {
            if (journalSet.isEmpty()) {
              throw new IOException("No journals available to flush");
            }
            //todo  交换内存
            editLogStream.setReadyToFlush();
          } catch (IOException e) {
            final String msg =
                "Could not sync enough journals to persistent storage " +
                "due to " + e.getMessage() + ". " +
                "Unsynced transactions: " + (txid - synctxid);
            LOG.fatal(msg, new Exception());
            synchronized(journalSetLock) {
              IOUtils.cleanup(LOG, journalSet);
            }
            terminate(1, msg);
          }
        } finally {
          // Prevent RuntimeException from blocking other log edit write 
          doneWithAutoSyncScheduling();
        }
        //editLogStream may become null,
        //so store a local variable for flush.
        logStream = editLogStream;
      }
      
      // do the sync
      long start = monotonicNow();
      try {
        if (logStream != null) {
          //todo 刷写到磁盘
          logStream.flush();
        }
      } catch (IOException ex) {
        synchronized (this) {
          final String msg =
              "Could not sync enough journals to persistent storage. "
              + "Unsynced transactions: " + (txid - synctxid);
          LOG.fatal(msg, new Exception());
          synchronized(journalSetLock) {
            IOUtils.cleanup(LOG, journalSet);
          }
          terminate(1, msg);
        }
      }
      long elapsed = monotonicNow() - start;
  
      if (metrics != null) { // Metrics non-null only when used inside name node
        metrics.addSync(elapsed);
      }
      
    } finally {
      // Prevent RuntimeException from blocking other log edit sync 
      synchronized (this) {
        if (sync) {
          synctxid = syncStart;
          isSyncRunning = false;
        }
        this.notifyAll();
     }
    }
  }
  • editLogStream.setReadyToFlush();中怎么进行内存交换???下面源码给出了答案
public void setReadyToFlush() {
    //todo 交换内存
    assert isFlushed() : "previous data not flushed yet";
    TxnBuffer tmp = bufReady;
    bufReady = bufCurrent;
    bufCurrent = tmp;
  }

问题二:答案也出来了,采用双缓存方案可以避免namenode元数据写入磁盘占用大量IO,导致速度变慢

问题三: namenode和journalnode之间是怎么通信的呢??刚才editLogStream.write(op),只看了EditLogFileOutputStrem类,也就是写入namenode元数据,现在看一下QuorumOutputStream类,看一下journalnode的元数据写入

loggers.sendEdits()方法进入AsyncLoggerSet.java是一个接口,那么实现类就是IPCLoggerChannel

@Override
  protected void flushAndSync(boolean durable) throws IOException {
    int numReadyBytes = buf.countReadyBytes();
    if (numReadyBytes > 0) {
      int numReadyTxns = buf.countReadyTxns();
      long firstTxToFlush = buf.getFirstReadyTxId();

      assert numReadyTxns > 0;

      // Copy from our double-buffer into a new byte array. This is for
      // two reasons:
      // 1) The IPC code has no way of specifying to send only a slice of
      //    a larger array.
      // 2) because the calls to the underlying nodes are asynchronous, we
      //    need a defensive copy to avoid accidentally mutating the buffer
      //    before it is sent.
      DataOutputBuffer bufToSend = new DataOutputBuffer(numReadyBytes);
      buf.flushTo(bufToSend);
      assert bufToSend.getLength() == numReadyBytes;
      byte[] data = bufToSend.getData();
      assert data.length == bufToSend.getLength();
      //todo 把数据写入journalnode
      QuorumCall<AsyncLogger, Void> qcall = loggers.sendEdits(
          segmentTxId, firstTxToFlush,
          numReadyTxns, data);
      //todo 这是一个阻塞的方法,等待journalnode集群处理结果
      loggers.waitForWriteQuorum(qcall, writeTimeoutMs, "sendEdits");
      
      // Since we successfully wrote this batch, let the loggers know. Any future
      // RPCs will thus let the loggers know of the most recent transaction, even
      // if a logger has fallen behind.
      loggers.setCommittedTxId(firstTxToFlush + numReadyTxns - 1);
    }
  }

在IPCLoggerChannel中找到sendEdits()方法,其中调用getProxy().journal()。说明问题三种namenode与journalnode之间是通过RPC进行通信。journalnode将元信息进行落盘。

getProxy().journal(createReqInfo(),
                segmentTxId, firstTxnId, numTxns, data);

editlog执行流程

checkpoint触发条件

待续。。。。。。分析问题四,问题五

EditLogTailer 是一个后台线程,启动了以后会周期的去journalnode集群上面去读取元数据日志,然后把这些元数据日志应用到自己的元数据里面(内存+磁盘)

EditLogTailThread()

standby namenode内存里数据要定时做checkpoint就是fsimage文件,替换active namenode中的fsimage文件

StandByCheckpoint : checkpoint条件  1.超过特定的条数、2.超过特定的时间