FileSystem的append方法文件内容追加坑记

原创

wx5af853e4b9fed 2021-07-16 09:35:16 ©著作权

©著作权归作者所有：来自51CTO博客作者wx5af853e4b9fed的原创作品，请联系作者获取转载授权，否则将追究法律责任

首先声明HDFS并不擅长append操作。

本文以循环追加内容到文件为例，文件大小达到1KB后，重新创建新文件继续写入，写满5个文件后程序停止……，代码如下：

package com.leboop.www;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

/**
 * Created by leb on 2018/4/16.
 */
public class HDFSAppend {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
//        conf.set("fs.defaultFS", "hdfs://192.168.128.11:9000");
        FileSystem fs = FileSystem.get(conf);
        System.out.println("fs = " + fs);
        String content = "Hello Word!\r\n";
        Path filePath = new Path("/append/file");
        FSDataOutputStream fsDos = null;
        int fileCount = 0;
        while (true) { // 循环向文件中写入数据
            if (!fs.exists(filePath)) {
                fsDos = fs.create(filePath, false);
                fileCount++;
            } else {
                if (fs.getFileStatus(filePath).getLen() > 1024) { // 文件大小超过1KB
                    fs.rename(filePath, new Path("/append/file_" + fileCount));
                    fsDos = fs.create(filePath, false);
                    fileCount++;
                } else {
                    fsDos = fs.append(filePath);
                }
            }
            fsDos.writeBytes(content);
            if (fileCount > 5) {
                break;
            }
        }
    }
}

本地方式

注释掉

conf.set("fs.defaultFS", "hdfs://192.168.128.11:9000");

本地运行结果如下：

FileSystem的append方法文件内容追加坑记_hdfs

fs = org.apache.hadoop.fs.LocalFileSystem@c267ef4
Exception in thread "main" java.io.IOException: Not supported
	at org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:357)
	at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1166)
	at com.leboop.www.HDFSAppend.main(HDFSAppend.java:32)

Process finished with exit code 1

本地不支持FileSystem的append方法。

集群方式

将程序打包，提交到Hadoop集群测试（也可以把代码注释去掉conf.set("fs.defaultFS", "hdfs://192.168.128.11:9000");），如图：

FileSystem的append方法文件内容追加坑记_hdfs_02

关键日志如下：

Exception in thread "main" org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to APPEND_FILE /append/file for DFSClient_NONMAPREDUCE_-
136843200_1 on 192.168.128.11 because DFSClient_NONMAPREDUCE_-136843200_1 is already the current lease holder.

集群测试FileSytem支持append方法，

本地方式Not Support和集群AlreadyBeingCreatedException问题解决

丢弃append方法，使用BufferedWriter，如下：

package com.leboop.www;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.*;

/**
 * Created by leb on 2018/4/16.
 */
public class HDFSAppend {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
//        conf.set("fs.defaultFS", "hdfs://192.168.128.11:9000");
        FileSystem fs = FileSystem.get(conf);
        System.out.println("fs = " + fs);
        String content = "Hello Word!Hello Word!Hello Word!Hello Word!Hello Word!\r\n";
        Path filePath = new Path("/append/file");

        int fileCount = 0;
        if (fs.exists(filePath)) {
            fs.delete(filePath, false);
        }
        OutputStream os = fs.create(filePath, false);
        fileCount++;
        BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(os));
        while (true) { // 循环向文件中写入数据
            long size = fs.getFileStatus(filePath).getLen();
            if (size > 0) {
                System.out.println("size=" + size);
            }
            if (size > 1024) { // 文件大小超过1KB
                bw.flush();
                bw.close(); // 重命名前需要关闭流
                Path newFile = new Path("/append/file_" + fileCount);
                fs.rename(filePath, newFile);
                os = fs.create(filePath, false);
                bw = new BufferedWriter(new OutputStreamWriter(os));
                fileCount++;
            }
            bw.write(content);
            if (fileCount > 5) {
                break;
            }
        }
    }
}

程序可以正常执行，如图：

FileSystem的append方法文件内容追加坑记_hdfs_03

并生成5个文件。但是文件并没有按照1KB分割，与BufferWriter缓存有关。

将代码重新打包提交Hadoop集群测试，如图：

FileSystem的append方法文件内容追加坑记_hadoop_04
程序运行一段时间后，如图：

FileSystem的append方法文件内容追加坑记_hdfs_05

HDFS上生产一个和数据块大小相同的文件，如图：

FileSystem的append方法文件内容追加坑记_hdfs_06

无论是本地还是集群，

fs.getFileStatus(filePath).getLen()

方法都无法获取到文件缓存在内存中的大小，而是达到溢出值写入磁盘后，才可以获取。所以这种方式并不能按照指定的大小分割文件。

不能按指定文件大小切分解决

（1）方法1

创建BufferWriter时，指定默认buffersize大小为分割大小

（2）方法2

每次调用获取文件大小时，关闭FileSystem，重新创建即可，核心代码如下：

fs.close();
fs=FileSystem.get(conf);

整体代码如下：

package com.leboop.www;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

/**
 * Created by leb on 2018/4/16.
 */
public class HDFSAppend {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        //conf.set("fs.defaultFS", "hdfs://192.168.128.11:9000");
        FileSystem fs = FileSystem.get(conf);
        System.out.println("fs = " + fs);
        String content = "Hello Word!\r\n";
        Path filePath = new Path("/append/file");
        FSDataOutputStream fsDos = null;
        int fileCount = 0;
        while (true) { // 循环向文件中写入数据
            fs.close();
            fs=FileSystem.get(conf);
            if (!fs.exists(filePath)) {
                fsDos = fs.create(filePath, false);
                fileCount++;
            } else {
                if (fs.getFileStatus(filePath).getLen() > 1024) { // 文件大小超过1KB
                    fs.rename(filePath, new Path("/append/file_" + fileCount));
                    fsDos = fs.create(filePath, false);
                    fileCount++;
                } else {
                    fsDos = fs.append(filePath);
                }
            }
            fsDos.writeBytes(content);
            if (fileCount > 5) {
                break;
            }
        }
    }
}

原理是写一次，就刷到磁盘，问题是效率极低。效率问题解决：先写入一个文件，使用IOUtils进行copy切分，或者将整个文件作为mr输入（只包含map操作）按分片切分。