hdfs获得namenode主备

转载

mob64ca1410eb61 2024-09-12 10:36:49

文章标签 hdfs获得namenode主备 hdfs hadoop java 目录树 文章分类 架构后端开发

hdfs元数据保存到内存

hdfs元数据保存到磁盘

editlog执行流程

checkpoint触发条件

前言：
带着问题思考，从源码中获得答案
问题一:namenode的内存中目录树是什么数据结构，与zookeeper相同吗？
问题二：namenode的元数据写磁盘，会特别的慢吗？采用什么机制能够使这个过程加快。
问题三：namenode与journalnode是怎么通信的，不同的进程怎么进行通信的
问题四：namdenode上fsimage和editlog什么时候会进行合并，standbyNamenode什么时候合并fsimage和editlog，并且namenode什么时候去standbyNameNode中获取fsimage
问题五：为什么要用standbynamenode去合并fsimage和editlog,而不是用activeNamenode去合并

hdfs元数据保存到内存

通过一个创建目录的代码，来看一下元数据是怎么保存在内存中的，本源码针对hadoop-2.7.x

public static void main(String[] args) throws IOException, URISyntaxException, InterruptedException {
        Configuration configuration=new Configuration();
        FileSystem fileSystem= FileSystem.get( new URI( "hdfs://192.168.101.51:9000" ), configuration,"root" );

        fileSystem.mkdirs(new Path("/user/mydata1"));

    }

代码执行成功后，查看hdfs

hdfs获得namenode主备_hdfs

大致的创建目录树的源码流程如图

hdfs获得namenode主备_hdfs获得namenode主备_02

下面是mkdirs时候，在目录树添加List<INode> children的过程。

首先进入FileSystem.java中fileSystem.mkdirs(Path f)

/**
   * Call {@link #mkdirs(Path, FsPermission)} with default permission.
   */
  public boolean mkdirs(Path f) throws IOException {
    return mkdirs(f, FsPermission.getDirDefault());
  }

再进入mkdirs,发现是一个抽象方法

/**
   * Make the given file and all non-existent parents into
   * directories. Has the semantics of Unix 'mkdir -p'.
   * Existence of the directory hierarchy is not an error.
   * @param f path to create
   * @param permission to apply to f
   */
  public abstract boolean mkdirs(Path f, FsPermission permission
      ) throws IOException;

找到实现类 DistributedFileSystem.java，调用mkdirsInternal(f,permission,true)

public boolean mkdirs(Path f, FsPermission permission) throws IOException {
    return mkdirsInternal(f, permission, true);
  }

最后通过DFSClient.java，namenode是ClientProtocol类，说明是个client远程调用server端

return namenode.mkdirs(src, absPermission, createParent);

找到NameNodeRPCServer.java 中mkdirs(src, FsPermission masked, boolean createParent)方法，

return namesystem.mkdirs(src,
        new PermissionStatus(getRemoteUser().getShortUserName(),
            null, masked), createParent);

进入到FSNamesystem.java

checkOperation(OperationCategory.WRITE);
      //todo 如果是安全模式，不能创建目录
      checkNameNodeSafeMode("Cannot create directory " + src);

      auditStat = FSDirMkdirOp.mkdirs(this, src, permissions, createParent);

再进入到 FSDirMkdirOp.java
在这个方法中，fsn.getFSDirectory()获取了目录树，然后判断hdfs中要写入的目录路径lastNode是否是文件类型,是的话报错，然后获取路径，判断length,最后都调用createChildrenDirectories(fsd, existing, ancestors, addImplicitUwx(permissions, permissions));创建目录方法

static HdfsFileStatus mkdirs(FSNamesystem fsn, String src,
      PermissionStatus permissions, boolean createParent) throws IOException {
    FSDirectory fsd = fsn.getFSDirectory();
    if(NameNode.stateChangeLog.isDebugEnabled()) {
      NameNode.stateChangeLog.debug("DIR* NameSystem.mkdirs: " + src);
    }
    if (!DFSUtil.isValidName(src)) {
      throw new InvalidPathException(src);
    }
    FSPermissionChecker pc = fsd.getPermissionChecker();
    byte[][] pathComponents = FSDirectory.getPathComponentsForReservedPath(src);
    fsd.writeLock();
    try {
      src = fsd.resolvePath(pc, src, pathComponents);
      INodesInPath iip = fsd.getINodesInPath4Write(src);
      if (fsd.isPermissionEnabled()) {
        fsd.checkTraverse(pc, iip);
      }
      /**
       * src /user/warehouse/hive/gmall.db/data
       * lastINode /user/warehouse/hive/gmall.db
       * 判断下最后的是不是文件，已经存在的
       */
      final INode lastINode = iip.getLastINode();
      if (lastINode != null && lastINode.isFile()) {
        throw new FileAlreadyExistsException("Path is not a directory: " + src);
      }

      INodesInPath existing = lastINode != null ? iip : iip.getExistingINodes();
      if (lastINode == null) {
        if (fsd.isPermissionEnabled()) {
          fsd.checkAncestorAccess(pc, iip, FsAction.WRITE);
        }

        if (!createParent) {
          fsd.verifyParentDir(iip, src);
        }

        // validate that we have enough inodes. This is, at best, a
        // heuristic because the mkdirs() operation might need to
        // create multiple inodes.
        fsn.checkFsObjectLimit();
        //todo iip中没有存在的，哪些目录需要创建
        List<String> nonExisting = iip.getPath(existing.length(),
            iip.length() - existing.length());
        int length = nonExisting.size();
        if (length > 1) {
          List<String> ancestors = nonExisting.subList(0, length - 1);
          // Ensure that the user can traversal the path by adding implicit
          // u+wx permission to all ancestor directories
          existing = createChildrenDirectories(fsd, existing, ancestors,
              addImplicitUwx(permissions, permissions));
          if (existing == null) {
            throw new IOException("Failed to create directory: " + src);
          }
        }

        if ((existing = createChildrenDirectories(fsd, existing,
            nonExisting.subList(length - 1, length), permissions)) == null) {
          throw new IOException("Failed to create directory: " + src);
        }
      }
      return fsd.getAuditFileInfo(existing);
    } finally {
      fsd.writeUnlock();
    }
  }

上一步根据判断路径length后，最后都要进入createChildrenDirectories，最后进入createSingleDirectory方法，在这个方法中unprotectedMkdir写入内存，fsd.getEditLog().logMkDir写入磁盘

private static INodesInPath createSingleDirectory(FSDirectory fsd,
      INodesInPath existing, String localName, PermissionStatus perm)
      throws IOException {
    assert fsd.hasWriteLock();
    //todo 将元信息写入内存
    existing = unprotectedMkdir(fsd, fsd.allocateNewInodeId(), existing,
        localName.getBytes(Charsets.UTF_8), perm, null, now());
    if (existing == null) {
      return null;
    }

    final INode newNode = existing.getLastINode();
    // Directory creation also count towards FilesCreated
    // to match count of FilesDeleted metric.
    NameNode.getNameNodeMetrics().incrFilesCreated();

    String cur = existing.getPath();
    //todo 把元信息写到磁盘
    fsd.getEditLog().logMkDir(cur, newNode);
    if (NameNode.stateChangeLog.isDebugEnabled()) {
      NameNode.stateChangeLog.debug("mkdirs: created directory " + cur);
    }
    return existing;
  }

最后进入FSDirectory.java 添加child节点

/**
   * Add a child to the end of the path specified by INodesInPath.
   * @return an INodesInPath instance containing the new INode
   */
  @VisibleForTesting
  public INodesInPath addLastINode(INodesInPath existing, INode inode,
      boolean checkQuota) throws QuotaExceededException {
    assert existing.getLastINode() != null &&
        existing.getLastINode().isDirectory();

    final int pos = existing.length();
    // Disallow creation of /.reserved. This may be created when loading
    // editlog/fsimage during upgrade since /.reserved was a valid name in older
    // release. This may also be called when a user tries to create a file
    // or directory /.reserved.
    if (pos == 1 && existing.getINode(0) == rootDir && isReservedName(inode)) {
      throw new HadoopIllegalArgumentException(
          "File name \"" + inode.getLocalName() + "\" is reserved and cannot "
              + "be created. If this is during upgrade change the name of the "
              + "existing file or directory to another name before upgrading "
              + "to the new release.");
    }
    //todo 获取父目录
    final INodeDirectory parent = existing.getINode(pos - 1).asDirectory();
    // The filesystem limits are not really quotas, so this check may appear
    // odd. It's because a rename operation deletes the src, tries to add
    // to the dest, if that fails, re-adds the src from whence it came.
    // The rename code disables the quota when it's restoring to the
    // original location because a quota violation would cause the the item
    // to go "poof".  The fs limits must be bypassed for the same reason.
    if (checkQuota) {
      final String parentPath = existing.getPath();
      verifyMaxComponentLength(inode.getLocalNameBytes(), parentPath);
      verifyMaxDirItems(parent, parentPath);
    }
    // always verify inode name
    verifyINodeName(inode.getLocalNameBytes());

    final QuotaCounts counts = inode.computeQuotaUsage(getBlockStoragePolicySuite());
    updateCount(existing, pos, counts, checkQuota);

    boolean isRename = (inode.getParent() != null);
    boolean added;
    try {
      //todo 添加子目录
      added = parent.addChild(inode, true, existing.getLatestSnapshotId());
    } catch (QuotaExceededException e) {
      updateCountNoQuotaCheck(existing, pos, counts.negation());
      throw e;
    }
    if (!added) {
      updateCountNoQuotaCheck(existing, pos, counts.negation());
      return null;
    } else {
      if (!isRename) {
        AclStorage.copyINodeDefaultAcl(inode);
      }
      addToInodeMap(inode);
    }
    return INodesInPath.append(existing, inode, inode.getLocalNameBytes());
  }

最后在INodeDirectory.java中添加 INode.java,INode.java有两个实现类文件

children.add(-insertionPoint - 1, node);

总结: 元数据管理FSDirectory.java管理内存中的目录树，目录树结构是个抽象类INode.java，INode.java有两个实现类，INodeDirectory.java/INodeFile.java,其中INodeDirectory中的 private List<INode> children = null; 子节点有两种类型，文件/目录。
所以问题一的答案是: 内存目录树的目录树是以List<chirldren> ，将每一层看作一个INode方式添加到list中去保存的，zookeeper以什么方式保存目录树等分析zookeeper源码时，一并查看。

hdfs元数据保存到磁盘

还是根据mkdirs()源码跟踪查找，元数据持久化磁盘的过程

还是先上个大致流程图

hdfs获得namenode主备_hdfs_03

在上面看到，fsd.getEditLog().logMkDir方法写入到磁盘。这个过程在RSEditLog.java logEdit方法中实现
这个方法已经来就加了个锁，在synchronized(this){}同步代码块中，有一个txid，初始值为0，在beginTransaction()进行 txid++,其中 editLogStream.write(op)是抽象方法，其中两个实现类QuorumOutputStream/EditLogFileOutputStrem分别是写journalnode/namenode的元数据,他们最终是EditsDoubleBuffer.java把操作写入bufCurrent。
返回的editLogStream是两个流

/**
   * Write an operation to the edit log. Do not sync to persistent
   * store yet.
   */
  void logEdit(final FSEditLogOp op) {
    synchronized (this) {
      assert isOpenForWrite() :
        "bad state: " + state;
      
      // wait if an automatic sync is scheduled
      //todo 刚开始不用等待
      waitIfAutoSyncScheduled();
      
      long start = beginTransaction();
      //todo 获取当前的事务id
      op.setTransactionId(txid);

      try {
        //todo 把元数据写到内存缓存，和journalNode
        //todo editLogStream有两个流
        //todo QuorumJournalManager QuorumOutputStream journalnode
        //todo FileJournalManager EditLogFileOutputStrem  本地磁盘
        editLogStream.write(op);
      } catch (IOException ex) {
        // All journals failed, it is handled in logSync.
      } finally {
        op.reset();
      }

      endTransaction(start);
      
      // check if it is time to schedule an automatic sync
      if (!shouldForceSync()) {
        return;
      }
      isAutoSyncScheduled = true;
    }
   // sync buffered edit log entries to persistent store
    //todo 同步缓存edit log 到永久存储
    logSync();
  }

在logSync()方法中，进行了双缓存内存交换，将一个缓存中的数据写入磁盘，另一个继续接收client端数据。
myTransactionId是个ThreadLocal,这就说明了txid(每个元数据操作)这个共享变量是线程安全的。
第一个线程进入到这个方法，一进来也是加了把锁，刚进来跳不进while (mytxid > synctxid && isSyncRunning)，将syncStart = txid; isSyncRunning = true;在这个过程中上个加索的方法中可能将所有元数据操作都写入bufCurrent了，现在editLogStream.setReadyToFlush();进行内存交换，然后 logStream.flush();刷写到磁盘

public void logSync() {
    long syncStart = 0;

    // Fetch the transactionId of this thread. 
    long mytxid = myTransactionId.get().txid;
    
    boolean sync = false;
    try {
      EditLogOutputStream logStream = null;
      synchronized (this) {
        try {
          printStatistics(false);

          // if somebody is already syncing, then wait
          while (mytxid > synctxid && isSyncRunning) {
            try {
              wait(1000);
            } catch (InterruptedException ie) {
            }
          }
  
          //
          // If this transaction was already flushed, then nothing to do
          //
          if (mytxid <= synctxid) {
            numTransactionsBatchedInSync++;
            if (metrics != null) {
              // Metrics is non-null only when used inside name node
              metrics.incrTransactionsBatchedInSync();
            }
            return;
          }
     
          // now, this thread will do the sync
          syncStart = txid;
          isSyncRunning = true;
          sync = true;
  
          // swap buffers
          try {
            if (journalSet.isEmpty()) {
              throw new IOException("No journals available to flush");
            }
            //todo  交换内存
            editLogStream.setReadyToFlush();
          } catch (IOException e) {
            final String msg =
                "Could not sync enough journals to persistent storage " +
                "due to " + e.getMessage() + ". " +
                "Unsynced transactions: " + (txid - synctxid);
            LOG.fatal(msg, new Exception());
            synchronized(journalSetLock) {
              IOUtils.cleanup(LOG, journalSet);
            }
            terminate(1, msg);
          }
        } finally {
          // Prevent RuntimeException from blocking other log edit write 
          doneWithAutoSyncScheduling();
        }
        //editLogStream may become null,
        //so store a local variable for flush.
        logStream = editLogStream;
      }
      
      // do the sync
      long start = monotonicNow();
      try {
        if (logStream != null) {
          //todo 刷写到磁盘
          logStream.flush();
        }
      } catch (IOException ex) {
        synchronized (this) {
          final String msg =
              "Could not sync enough journals to persistent storage. "
              + "Unsynced transactions: " + (txid - synctxid);
          LOG.fatal(msg, new Exception());
          synchronized(journalSetLock) {
            IOUtils.cleanup(LOG, journalSet);
          }
          terminate(1, msg);
        }
      }
      long elapsed = monotonicNow() - start;
  
      if (metrics != null) { // Metrics non-null only when used inside name node
        metrics.addSync(elapsed);
      }
      
    } finally {
      // Prevent RuntimeException from blocking other log edit sync 
      synchronized (this) {
        if (sync) {
          synctxid = syncStart;
          isSyncRunning = false;
        }
        this.notifyAll();
     }
    }
  }

editLogStream.setReadyToFlush();中怎么进行内存交换？？？下面源码给出了答案

public void setReadyToFlush() {
    //todo 交换内存
    assert isFlushed() : "previous data not flushed yet";
    TxnBuffer tmp = bufReady;
    bufReady = bufCurrent;
    bufCurrent = tmp;
  }

问题二：答案也出来了，采用双缓存方案可以避免namenode元数据写入磁盘占用大量IO，导致速度变慢
问题三: namenode和journalnode之间是怎么通信的呢？？刚才editLogStream.write(op)，只看了EditLogFileOutputStrem类，也就是写入namenode元数据，现在看一下QuorumOutputStream类，看一下journalnode的元数据写入
loggers.sendEdits()方法进入AsyncLoggerSet.java是一个接口，那么实现类就是IPCLoggerChannel

@Override
  protected void flushAndSync(boolean durable) throws IOException {
    int numReadyBytes = buf.countReadyBytes();
    if (numReadyBytes > 0) {
      int numReadyTxns = buf.countReadyTxns();
      long firstTxToFlush = buf.getFirstReadyTxId();

      assert numReadyTxns > 0;

      // Copy from our double-buffer into a new byte array. This is for
      // two reasons:
      // 1) The IPC code has no way of specifying to send only a slice of
      //    a larger array.
      // 2) because the calls to the underlying nodes are asynchronous, we
      //    need a defensive copy to avoid accidentally mutating the buffer
      //    before it is sent.
      DataOutputBuffer bufToSend = new DataOutputBuffer(numReadyBytes);
      buf.flushTo(bufToSend);
      assert bufToSend.getLength() == numReadyBytes;
      byte[] data = bufToSend.getData();
      assert data.length == bufToSend.getLength();
      //todo 把数据写入journalnode
      QuorumCall<AsyncLogger, Void> qcall = loggers.sendEdits(
          segmentTxId, firstTxToFlush,
          numReadyTxns, data);
      //todo 这是一个阻塞的方法，等待journalnode集群处理结果
      loggers.waitForWriteQuorum(qcall, writeTimeoutMs, "sendEdits");
      
      // Since we successfully wrote this batch, let the loggers know. Any future
      // RPCs will thus let the loggers know of the most recent transaction, even
      // if a logger has fallen behind.
      loggers.setCommittedTxId(firstTxToFlush + numReadyTxns - 1);
    }
  }

在IPCLoggerChannel中找到sendEdits()方法，其中调用getProxy().journal()。说明问题三种namenode与journalnode之间是通过RPC进行通信。journalnode将元信息进行落盘。

getProxy().journal(createReqInfo(),
                segmentTxId, firstTxnId, numTxns, data);

editlog执行流程

checkpoint触发条件

待续。。。。。。分析问题四，问题五

EditLogTailer 是一个后台线程，启动了以后会周期的去journalnode集群上面去读取元数据日志，然后把这些元数据日志应用到自己的元数据里面(内存+磁盘)
EditLogTailThread()
standby namenode内存里数据要定时做checkpoint就是fsimage文件，替换active namenode中的fsimage文件
StandByCheckpoint : checkpoint条件 1.超过特定的条数、2.超过特定的时间

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。