背景:因orc的存储格式引起的问题相对来说较多,所以公司决定所有的表都采用parquet格式,因为datax插件需要增加parquet格式的支持。
com.alibaba.datax.common.exception.DataXException: Code:[HdfsWriter-04], Description:[您配置的文件在写入时出现IO异常.]. - java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,INT,INT,TIMESTAMP' but 'STRING' is found.
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:112)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.parquetFileStartWrite(HdfsHelper.java:595)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter$Task.startWrite(HdfsWriter.java:365)
at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56)
at java.lang.Thread.run(Thread.java:748)
- java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,INT,INT,TIMESTAMP' but 'STRING' is found.
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:112)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.parquetFileStartWrite(HdfsHelper.java:595)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter$Task.startWrite(HdfsWriter.java:365)
at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56)
at java.lang.Thread.run(Thread.java:748)
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:40) ~[datax-common-1.0.1.jar:na]
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.parquetFileStartWrite(HdfsHelper.java:609) ~[hdfswriter-1.0.1.jar:na]
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter$Task.startWrite(HdfsWriter.java:365) ~[hdfswriter-1.0.1.jar:na]
at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56) ~[datax-core-1.0.1.jar:na]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_202]
Caused by: java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,INT,INT,TIMESTAMP' but 'STRING' is found.
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) ~[hive-serde-1.1.1.jar:1.1.1]
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) ~[hive-serde-1.1.1.jar:1.1.1]
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) ~[hive-serde-1.1.1.jar:1.1.1]
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) ~[hive-serde-1.1.1.jar:1.1.1]
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754) ~[hive-serde-1.1.1.jar:1.1.1]
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:112) ~[hive-exec-1.1.1.jar:1.1.1]
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.parquetFileStartWrite(HdfsHelper.java:595) ~[hdfswriter-1.0.1.jar:na]
... 3 common frames omitted
Exception in thread "taskGroup-0" com.alibaba.datax.common.exception.DataXException: Code:[HdfsWriter-04], Description:[您配置的文件在写入时出现IO异常.]. - java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,INT,INT,TIMESTAMP' but 'STRING' is found.
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:112)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.parquetFileStartWrite(HdfsHelper.java:595)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter$Task.startWrite(HdfsWriter.java:365)
at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56)
at java.lang.Thread.run(Thread.java:748)
- java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,INT,INT,TIMESTAMP' but 'STRING' is found.
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:112)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.parquetFileStartWrite(HdfsHelper.java:595)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter$Task.startWrite(HdfsWriter.java:365)
at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56)
at java.lang.Thread.run(Thread.java:748)
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:40)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.parquetFileStartWrite(HdfsHelper.java:609)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter$Task.startWrite(HdfsWriter.java:365)
at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,INT,INT,TIMESTAMP' but 'STRING' is found.
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:112)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.parquetFileStartWrite(HdfsHelper.java:595)
... 3 more
[INFO] 2023-04-14 11:54:02.785 - [taskAppId=TASK-12871-337376-578849]:[131] - -> 2023-04-14 11:54:02.681 [job-0] INFO StandAloneJobContainerCommunicator - Total 1 records, 42 bytes | Speed 4B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 0.00%
2023-04-14 11:54:02.681 [job-0] ERROR JobContainer - 运行scheduler 模式[standalone]出错.
2023-04-14 11:54:02.683 [job-0] ERROR JobContainer - Exception when job run
com.alibaba.datax.common.exception.DataXException: Code:[HdfsWriter-04], Description:[您配置的文件在写入时出现IO异常.]. - java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,INT,INT,TIMESTAMP' but 'STRING' is found.
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:112)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.parquetFileStartWrite(HdfsHelper.java:595)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter$Task.startWrite(HdfsWriter.java:365)
at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56)
at java.lang.Thread.run(Thread.java:748)
- java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,INT,INT,TIMESTAMP' but 'STRING' is found.
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:112)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.parquetFileStartWrite(HdfsHelper.java:595)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter$Task.startWrite(HdfsWriter.java:365)
at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56)
at java.lang.Thread.run(Thread.java:748)
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:40) ~[datax-common-1.0.1.jar:na]
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.parquetFileStartWrite(HdfsHelper.java:609) ~[na:na]
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter$Task.startWrite(HdfsWriter.java:365) ~[na:na]
at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56) ~[datax-core-1.0.1.jar:na]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_202]
Caused by: java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,INT,INT,TIMESTAMP' but 'STRING' is found.
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) ~[na:na]
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) ~[na:na]
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) ~[na:na]
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) ~[na:na]
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754) ~[na:na]
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:112) ~[na:na]
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.parquetFileStartWrite(HdfsHelper.java:595) ~[na:na]
... 3 common frames omitted
2023-04-14 11:54:02.684 [job-0] INFO StandAloneJobContainerCommunicator - Total 1 records, 42 bytes | Speed 42B/s, 1 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 0.00%
[INFO] 2023-04-14 11:54:02.904 - [taskAppId=TASK-12871-337376-578849]:[217] - process has exited, execute path:/opt/soft/dolphinscheduler/tmp/dolphinscheduler/exec/process/135/12871/337376/578849, processId:19583 ,exitStatusCode:0
[INFO] 2023-04-14 11:54:03.788 - [taskAppId=TASK-12871-337376-578849]:[131] - -> 2023-04-14 11:54:02.791 [job-0] ERROR Engine -
经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[HdfsWriter-04], Description:[您配置的文件在写入时出现IO异常.]. - java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,INT,INT,TIMESTAMP' but 'STRING' is found.
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:112)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.parquetFileStartWrite(HdfsHelper.java:595)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter$Task.startWrite(HdfsWriter.java:365)
at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56)
at java.lang.Thread.run(Thread.java:748)
- java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,INT,INT,TIMESTAMP' but 'STRING' is found.
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:112)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.parquetFileStartWrite(HdfsHelper.java:595)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter$Task.startWrite(HdfsWriter.java:365)
at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56)
at java.lang.Thread.run(Thread.java:748)
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:40)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.parquetFileStartWrite(HdfsHelper.java:609)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsWriter$Task.startWrite(HdfsWriter.java:365)
at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,STRING,INT,INT,TIMESTAMP' but 'STRING' is found.
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:112)
at com.alibaba.datax.plugin.writer.hdfswriter.HdfsHelper.parquetFileStartWrite(HdfsHelper.java:595)
... 3 more
最终定位的问题是由于配置文件中配置的字段大写引起的,我们公司生成的datax的配置文件,都是大写,最终导致问题的发生。
定位问题就好解决了,再获取类型的时候直接给转成小写。
福利来了,附改造写入parquet格式的代码,主要修改了两处,根据各自需求进行修改即可。
HdfsWriter.java
package com.alibaba.datax.plugin.writer.hdfswriter;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.datax.common.plugin.RecordReceiver;
import com.alibaba.datax.common.spi.Writer;
import com.alibaba.datax.common.util.Configuration;
import com.alibaba.datax.plugin.unstructuredstorage.writer.Constant;
import com.google.common.collect.Sets;
import org.apache.commons.io.Charsets;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.fs.Path;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.*;
public class HdfsWriter extends Writer {
public static class Job extends Writer.Job {
private static final Logger LOG = LoggerFactory.getLogger(Job.class);
private Configuration writerSliceConfig = null;
private String defaultFS;
private String path;
private String fileType;
private String fileName;
private List<Configuration> columns;
private String writeMode;
private String fieldDelimiter;
private String compress;
private String encoding;
private HashSet<String> tmpFiles = new HashSet<String>();//临时文件全路径
private HashSet<String> endFiles = new HashSet<String>();//最终文件全路径
private HdfsHelper hdfsHelper = null;
@Override
public void init() {
this.writerSliceConfig = this.getPluginJobConf();
this.validateParameter();
//创建textfile存储
hdfsHelper = new HdfsHelper();
hdfsHelper.getFileSystem(defaultFS, this.writerSliceConfig);
}
private void validateParameter() {
this.defaultFS = this.writerSliceConfig.getNecessaryValue(Key.DEFAULT_FS, HdfsWriterErrorCode.REQUIRED_VALUE);
//fileType check
this.fileType = this.writerSliceConfig.getNecessaryValue(Key.FILE_TYPE, HdfsWriterErrorCode.REQUIRED_VALUE);
if (!fileType.equalsIgnoreCase("ORC") && !fileType.equalsIgnoreCase("TEXT") && !fileType.equalsIgnoreCase("PARQUET")) {
String message = "HdfsWriter插件目前只支持ORC和TEXT和PARQUET三种格式的文件,请将filetype选项的值配置为ORC或者TEXT或者PARQUET";
throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, message);
}
//path
this.path = this.writerSliceConfig.getNecessaryValue(Key.PATH, HdfsWriterErrorCode.REQUIRED_VALUE);
if (!path.startsWith("/")) {
String message = String.format("请检查参数path:[%s],需要配置为绝对路径", path);
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, message);
} else if (path.contains("*") || path.contains("?")) {
String message = String.format("请检查参数path:[%s],不能包含*,?等特殊字符", path);
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE, message);
}
//fileName
this.fileName = this.writerSliceConfig.getNecessaryValue(Key.FILE_NAME, HdfsWriterErrorCode.REQUIRED_VALUE);
//columns check
this.columns = this.writerSliceConfig.getListConfiguration(Key.COLUMN);
if (null == columns || columns.size() == 0) {
throw DataXException.asDataXException(HdfsWriterErrorCode.REQUIRED_VALUE, "您需要指定 columns");
} else {
for (Configuration eachColumnConf : columns) {
eachColumnConf.getNecessaryValue(Key.NAME, HdfsWriterErrorCode.COLUMN_REQUIRED_VALUE);
eachColumnConf.getNecessaryValue(Key.TYPE, HdfsWriterErrorCode.COLUMN_REQUIRED_VALUE);
}
}
//writeMode check
this.writeMode = this.writerSliceConfig.getNecessaryValue(Key.WRITE_MODE, HdfsWriterErrorCode.REQUIRED_VALUE);
writeMode = writeMode.toLowerCase().trim();
Set<String> supportedWriteModes = Sets.newHashSet("append", "nonconflict", "truncate");
if (!supportedWriteModes.contains(writeMode)) {
throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE,
String.format("仅支持append, nonConflict, truncate三种模式, 不支持您配置的 writeMode 模式 : [%s]",
writeMode));
}
this.writerSliceConfig.set(Key.WRITE_MODE, writeMode);
//fieldDelimiter check
this.fieldDelimiter = this.writerSliceConfig.getString(Key.FIELD_DELIMITER, null);
if (null == fieldDelimiter) {
throw DataXException.asDataXException(HdfsWriterErrorCode.REQUIRED_VALUE,
String.format("您提供配置文件有误,[%s]是必填参数.", Key.FIELD_DELIMITER));
} else if (1 != fieldDelimiter.length()) {
// warn: if have, length must be one
throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE,
String.format("仅仅支持单字符切分, 您配置的切分为 : [%s]", fieldDelimiter));
}
//compress check
this.compress = this.writerSliceConfig.getString(Key.COMPRESS, null);
if (fileType.equalsIgnoreCase("TEXT")) {
Set<String> textSupportedCompress = Sets.newHashSet("GZIP", "BZIP2");
//用户可能配置的是compress:"",空字符串,需要将compress设置为null
if (StringUtils.isBlank(compress)) {
this.writerSliceConfig.set(Key.COMPRESS, null);
} else {
compress = compress.toUpperCase().trim();
if (!textSupportedCompress.contains(compress)) {
throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE,
String.format("目前TEXT FILE仅支持GZIP、BZIP2 两种压缩, 不支持您配置的 compress 模式 : [%s]",
compress));
}
}
} else if (fileType.equalsIgnoreCase("ORC")) {
Set<String> orcSupportedCompress = Sets.newHashSet("NONE", "SNAPPY");
if (null == compress) {
this.writerSliceConfig.set(Key.COMPRESS, "NONE");
} else {
compress = compress.toUpperCase().trim();
if (!orcSupportedCompress.contains(compress)) {
throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE,
String.format("目前ORC FILE仅支持SNAPPY压缩, 不支持您配置的 compress 模式 : [%s]",
compress));
}
}
} else if (fileType.equalsIgnoreCase("PARQUET")) {
Set<String> parquetSupportedCompress = Sets.newHashSet("NONE", "SNAPPY");
if (null == compress) {
this.writerSliceConfig.set(Key.COMPRESS, "NONE");
} else {
compress = compress.toUpperCase().trim();
if (!parquetSupportedCompress.contains(compress)) {
throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE,
String.format("目前SNAPPY FILE仅支持SNAPPY压缩, 不支持您配置的 compress 模式 : [%s]",
compress));
}
}
}
}
@Override
public void prepare() {
//若路径已经存在,检查path是否是目录
if (hdfsHelper.isPathexists(path)) {
if (!hdfsHelper.isPathDir(path)) {
throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE,
String.format("您配置的path: [%s] 不是一个合法的目录, 请您注意文件重名, 不合法目录名等情况.",
path));
}
//根据writeMode对目录下文件进行处理
Path[] existFilePaths = hdfsHelper.hdfsDirList(path, fileName);
boolean isExistFile = false;
if (existFilePaths.length > 0) {
isExistFile = true;
}
/**
if ("truncate".equals(writeMode) && isExistFile ) {
LOG.info(String.format("由于您配置了writeMode truncate, 开始清理 [%s] 下面以 [%s] 开头的内容",
path, fileName));
hdfsHelper.deleteFiles(existFilePaths);
} else
*/
if ("append".equalsIgnoreCase(writeMode)) {
LOG.info(String.format("由于您配置了writeMode append, 写入前不做清理工作, [%s] 目录下写入相应文件名前缀 [%s] 的文件",
path, fileName));
} else if ("nonconflict".equalsIgnoreCase(writeMode) && isExistFile) {
LOG.info(String.format("由于您配置了writeMode nonConflict, 开始检查 [%s] 下面的内容", path));
List<String> allFiles = new ArrayList<String>();
for (Path eachFile : existFilePaths) {
allFiles.add(eachFile.toString());
}
LOG.error(String.format("冲突文件列表为: [%s]", StringUtils.join(allFiles, ",")));
throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE,
String.format("由于您配置了writeMode nonConflict,但您配置的path: [%s] 目录不为空, 下面存在其他文件或文件夹.", path));
}
} else {
throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE,
String.format("您配置的path: [%s] 不存在, 请先在hive端创建对应的数据库和表.", path));
}
}
@Override
public void post() {
hdfsHelper.renameFile(tmpFiles, endFiles);
}
@Override
public void destroy() {
hdfsHelper.closeFileSystem();
}
@Override
public List<Configuration> split(int mandatoryNumber) {
LOG.info("begin do split...");
List<Configuration> writerSplitConfigs = new ArrayList<Configuration>();
String filePrefix = fileName;
Set<String> allFiles = new HashSet<String>();
//获取该路径下的所有已有文件列表
if (hdfsHelper.isPathexists(path)) {
allFiles.addAll(Arrays.asList(hdfsHelper.hdfsDirList(path)));
}
String fileSuffix;
//临时存放路径
String storePath = buildTmpFilePath(this.path);
//最终存放路径
String endStorePath = buildFilePath();
this.path = endStorePath;
for (int i = 0; i < mandatoryNumber; i++) {
// handle same file name
Configuration splitedTaskConfig = this.writerSliceConfig.clone();
String fullFileName = null;
String endFullFileName = null;
fileSuffix = UUID.randomUUID().toString().replace('-', '_');
fullFileName = String.format("%s%s%s__%s", defaultFS, storePath, filePrefix, fileSuffix);
endFullFileName = String.format("%s%s%s__%s", defaultFS, endStorePath, filePrefix, fileSuffix);
while (allFiles.contains(endFullFileName)) {
fileSuffix = UUID.randomUUID().toString().replace('-', '_');
fullFileName = String.format("%s%s%s__%s", defaultFS, storePath, filePrefix, fileSuffix);
endFullFileName = String.format("%s%s%s__%s", defaultFS, endStorePath, filePrefix, fileSuffix);
}
allFiles.add(endFullFileName);
//设置临时文件全路径和最终文件全路径
if ("GZIP".equalsIgnoreCase(this.compress)) {
this.tmpFiles.add(fullFileName + ".gz");
this.endFiles.add(endFullFileName + ".gz");
} else if ("BZIP2".equalsIgnoreCase(compress)) {
this.tmpFiles.add(fullFileName + ".bz2");
this.endFiles.add(endFullFileName + ".bz2");
} else {
this.tmpFiles.add(fullFileName);
this.endFiles.add(endFullFileName);
}
splitedTaskConfig
.set(com.alibaba.datax.plugin.unstructuredstorage.writer.Key.FILE_NAME,
fullFileName);
LOG.info(String.format("splited write file name:[%s]",
fullFileName));
writerSplitConfigs.add(splitedTaskConfig);
}
LOG.info("end do split.");
return writerSplitConfigs;
}
private String buildFilePath() {
boolean isEndWithSeparator = false;
switch (IOUtils.DIR_SEPARATOR) {
case IOUtils.DIR_SEPARATOR_UNIX:
isEndWithSeparator = this.path.endsWith(String
.valueOf(IOUtils.DIR_SEPARATOR));
break;
case IOUtils.DIR_SEPARATOR_WINDOWS:
isEndWithSeparator = this.path.endsWith(String
.valueOf(IOUtils.DIR_SEPARATOR_WINDOWS));
break;
default:
break;
}
if (!isEndWithSeparator) {
this.path = this.path + IOUtils.DIR_SEPARATOR;
}
return this.path;
}
/**
* 创建临时目录
*
* @param userPath
* @return
*/
private String buildTmpFilePath(String userPath) {
String tmpFilePath;
boolean isEndWithSeparator = false;
switch (IOUtils.DIR_SEPARATOR) {
case IOUtils.DIR_SEPARATOR_UNIX:
isEndWithSeparator = userPath.endsWith(String
.valueOf(IOUtils.DIR_SEPARATOR));
break;
case IOUtils.DIR_SEPARATOR_WINDOWS:
isEndWithSeparator = userPath.endsWith(String
.valueOf(IOUtils.DIR_SEPARATOR_WINDOWS));
break;
default:
break;
}
String tmpSuffix;
tmpSuffix = UUID.randomUUID().toString().replace('-', '_');
if (!isEndWithSeparator) {
tmpFilePath = String.format("%s__%s%s", userPath, tmpSuffix, IOUtils.DIR_SEPARATOR);
} else if ("/".equals(userPath)) {
tmpFilePath = String.format("%s__%s%s", userPath, tmpSuffix, IOUtils.DIR_SEPARATOR);
} else {
tmpFilePath = String.format("%s__%s%s", userPath.substring(0, userPath.length() - 1), tmpSuffix, IOUtils.DIR_SEPARATOR);
}
while (hdfsHelper.isPathexists(tmpFilePath)) {
tmpSuffix = UUID.randomUUID().toString().replace('-', '_');
if (!isEndWithSeparator) {
tmpFilePath = String.format("%s__%s%s", userPath, tmpSuffix, IOUtils.DIR_SEPARATOR);
} else if ("/".equals(userPath)) {
tmpFilePath = String.format("%s__%s%s", userPath, tmpSuffix, IOUtils.DIR_SEPARATOR);
} else {
tmpFilePath = String.format("%s__%s%s", userPath.substring(0, userPath.length() - 1), tmpSuffix, IOUtils.DIR_SEPARATOR);
}
}
return tmpFilePath;
}
}
public static class Task extends Writer.Task {
private static final Logger LOG = LoggerFactory.getLogger(Task.class);
private Configuration writerSliceConfig;
private String defaultFS;
private String fileType;
private String fileName;
private HdfsHelper hdfsHelper = null;
@Override
public void init() {
this.writerSliceConfig = this.getPluginJobConf();
this.defaultFS = this.writerSliceConfig.getString(Key.DEFAULT_FS);
this.fileType = this.writerSliceConfig.getString(Key.FILE_TYPE);
//得当的已经是绝对路径,eg:hdfs://10.101.204.12:9000/user/hive/warehouse/writer.db/text/test.textfile
this.fileName = this.writerSliceConfig.getString(Key.FILE_NAME);
hdfsHelper = new HdfsHelper();
hdfsHelper.getFileSystem(defaultFS, writerSliceConfig);
}
@Override
public void prepare() {
}
@Override
public void startWrite(RecordReceiver lineReceiver) {
LOG.info("begin do write...");
LOG.info(String.format("write to file : [%s]", this.fileName));
if (fileType.equalsIgnoreCase("TEXT")) {
//写TEXT FILE
hdfsHelper.textFileStartWrite(lineReceiver, this.writerSliceConfig, this.fileName,
this.getTaskPluginCollector());
} else if (fileType.equalsIgnoreCase("ORC")) {
//写ORC FILE
hdfsHelper.orcFileStartWrite(lineReceiver, this.writerSliceConfig, this.fileName,
this.getTaskPluginCollector());
} else if (fileType.equalsIgnoreCase("PARQUET")) {
//写PARQUET FILE
hdfsHelper.parquetFileStartWrite(lineReceiver, this.writerSliceConfig, this.fileName,
this.getTaskPluginCollector());
}
LOG.info("end do write");
}
@Override
public void post() {
}
@Override
public void destroy() {
}
}
}
HdfsHelper.java
package com.alibaba.datax.plugin.writer.hdfswriter;
import com.alibaba.datax.common.element.Column;
import com.alibaba.datax.common.element.Record;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.datax.common.plugin.RecordReceiver;
import com.alibaba.datax.common.plugin.TaskPluginCollector;
import com.alibaba.datax.common.util.Configuration;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.google.common.collect.Lists;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.tuple.MutablePair;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat;
import org.apache.hadoop.hive.ql.io.orc.OrcSerde;
import org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat;
import org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.ObjectWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.security.UserGroupInformation;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.*;
public class HdfsHelper {
public static final Logger LOG = LoggerFactory.getLogger(HdfsWriter.Job.class);
public FileSystem fileSystem = null;
public JobConf conf = null;
public org.apache.hadoop.conf.Configuration hadoopConf = null;
public static final String HADOOP_SECURITY_AUTHENTICATION_KEY = "hadoop.security.authentication";
public static final String HDFS_DEFAULTFS_KEY = "fs.defaultFS";
// Kerberos
private Boolean haveKerberos = false;
private String kerberosKeytabFilePath;
private String kerberosPrincipal;
public void getFileSystem(String defaultFS, Configuration taskConfig) {
hadoopConf = new org.apache.hadoop.conf.Configuration();
Configuration hadoopSiteParams = taskConfig.getConfiguration(Key.HADOOP_CONFIG);
JSONObject hadoopSiteParamsAsJsonObject = JSON.parseObject(taskConfig.getString(Key.HADOOP_CONFIG));
if (null != hadoopSiteParams) {
Set<String> paramKeys = hadoopSiteParams.getKeys();
for (String each : paramKeys) {
hadoopConf.set(each, hadoopSiteParamsAsJsonObject.getString(each));
}
}
hadoopConf.set(HDFS_DEFAULTFS_KEY, defaultFS);
//是否有Kerberos认证
this.haveKerberos = taskConfig.getBool(Key.HAVE_KERBEROS, false);
if (haveKerberos) {
this.kerberosKeytabFilePath = taskConfig.getString(Key.KERBEROS_KEYTAB_FILE_PATH);
this.kerberosPrincipal = taskConfig.getString(Key.KERBEROS_PRINCIPAL);
hadoopConf.set(HADOOP_SECURITY_AUTHENTICATION_KEY, "kerberos");
}
this.kerberosAuthentication(this.kerberosPrincipal, this.kerberosKeytabFilePath);
conf = new JobConf(hadoopConf);
try {
fileSystem = FileSystem.get(conf);
} catch (IOException e) {
String message = String.format("获取FileSystem时发生网络IO异常,请检查您的网络是否正常!HDFS地址:[%s]",
"message:defaultFS =" + defaultFS);
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e);
} catch (Exception e) {
String message = String.format("获取FileSystem失败,请检查HDFS地址是否正确: [%s]",
"message:defaultFS =" + defaultFS);
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e);
}
if (null == fileSystem || null == conf) {
String message = String.format("获取FileSystem失败,请检查HDFS地址是否正确: [%s]",
"message:defaultFS =" + defaultFS);
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, message);
}
}
private void kerberosAuthentication(String kerberosPrincipal, String kerberosKeytabFilePath) {
if (haveKerberos && StringUtils.isNotBlank(this.kerberosPrincipal) && StringUtils.isNotBlank(this.kerberosKeytabFilePath)) {
UserGroupInformation.setConfiguration(this.hadoopConf);
try {
UserGroupInformation.loginUserFromKeytab(kerberosPrincipal, kerberosKeytabFilePath);
} catch (Exception e) {
String message = String.format("kerberos认证失败,请确定kerberosKeytabFilePath[%s]和kerberosPrincipal[%s]填写正确",
kerberosKeytabFilePath, kerberosPrincipal);
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.KERBEROS_LOGIN_ERROR, e);
}
}
}
/**
* 获取指定目录先的文件列表
*
* @param dir
* @return 拿到的是文件全路径,
* eg:hdfs://10.101.204.12:9000/user/hive/warehouse/writer.db/text/test.textfile
*/
public String[] hdfsDirList(String dir) {
Path path = new Path(dir);
String[] files = null;
try {
FileStatus[] status = fileSystem.listStatus(path);
files = new String[status.length];
for (int i = 0; i < status.length; i++) {
files[i] = status[i].getPath().toString();
}
} catch (IOException e) {
String message = String.format("获取目录[%s]文件列表时发生网络IO异常,请检查您的网络是否正常!", dir);
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e);
}
return files;
}
/**
* 获取以fileName__ 开头的文件列表
*
* @param dir
* @param fileName
* @return
*/
public Path[] hdfsDirList(String dir, String fileName) {
Path path = new Path(dir);
Path[] files = null;
String filterFileName = fileName + "__*";
try {
PathFilter pathFilter = new GlobFilter(filterFileName);
FileStatus[] status = fileSystem.listStatus(path, pathFilter);
files = new Path[status.length];
for (int i = 0; i < status.length; i++) {
files[i] = status[i].getPath();
}
} catch (IOException e) {
String message = String.format("获取目录[%s]下文件名以[%s]开头的文件列表时发生网络IO异常,请检查您的网络是否正常!",
dir, fileName);
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e);
}
return files;
}
public boolean isPathexists(String filePath) {
Path path = new Path(filePath);
boolean exist = false;
try {
exist = fileSystem.exists(path);
} catch (IOException e) {
String message = String.format("判断文件路径[%s]是否存在时发生网络IO异常,请检查您的网络是否正常!",
"message:filePath =" + filePath);
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e);
}
return exist;
}
public boolean isPathDir(String filePath) {
Path path = new Path(filePath);
boolean isDir = false;
try {
isDir = fileSystem.isDirectory(path);
} catch (IOException e) {
String message = String.format("判断路径[%s]是否是目录时发生网络IO异常,请检查您的网络是否正常!", filePath);
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e);
}
return isDir;
}
public void deleteFiles(Path[] paths) {
for (int i = 0; i < paths.length; i++) {
LOG.info(String.format("delete file [%s].", paths[i].toString()));
try {
fileSystem.delete(paths[i], true);
} catch (IOException e) {
String message = String.format("删除文件[%s]时发生IO异常,请检查您的网络是否正常!",
paths[i].toString());
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e);
}
}
}
public void deleteDir(Path path) {
LOG.info(String.format("start delete tmp dir [%s] .", path.toString()));
try {
if (isPathexists(path.toString())) {
fileSystem.delete(path, true);
}
} catch (Exception e) {
String message = String.format("删除临时目录[%s]时发生IO异常,请检查您的网络是否正常!", path.toString());
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e);
}
LOG.info(String.format("finish delete tmp dir [%s] .", path.toString()));
}
public void renameFile(HashSet<String> tmpFiles, HashSet<String> endFiles) {
Path tmpFilesParent = null;
if (tmpFiles.size() != endFiles.size()) {
String message = String.format("临时目录下文件名个数与目标文件名个数不一致!");
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.HDFS_RENAME_FILE_ERROR, message);
} else {
try {
for (Iterator it1 = tmpFiles.iterator(), it2 = endFiles.iterator(); it1.hasNext() && it2.hasNext(); ) {
String srcFile = it1.next().toString();
String dstFile = it2.next().toString();
Path srcFilePah = new Path(srcFile);
Path dstFilePah = new Path(dstFile);
if (tmpFilesParent == null) {
tmpFilesParent = srcFilePah.getParent();
}
LOG.info(String.format("start rename file [%s] to file [%s].", srcFile, dstFile));
boolean renameTag = false;
long fileLen = fileSystem.getFileStatus(srcFilePah).getLen();
if (fileLen > 0) {
renameTag = fileSystem.rename(srcFilePah, dstFilePah);
if (!renameTag) {
String message = String.format("重命名文件[%s]失败,请检查您的网络是否正常!", srcFile);
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.HDFS_RENAME_FILE_ERROR, message);
}
LOG.info(String.format("finish rename file [%s] to file [%s].", srcFile, dstFile));
} else {
LOG.info(String.format("文件[%s]内容为空,请检查写入是否正常!", srcFile));
}
}
} catch (Exception e) {
String message = String.format("重命名文件时发生异常,请检查您的网络是否正常!");
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e);
} finally {
deleteDir(tmpFilesParent);
}
}
}
//关闭FileSystem
public void closeFileSystem() {
try {
fileSystem.close();
} catch (IOException e) {
String message = String.format("关闭FileSystem时发生IO异常,请检查您的网络是否正常!");
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.CONNECT_HDFS_IO_ERROR, e);
}
}
//textfile格式文件
public FSDataOutputStream getOutputStream(String path) {
Path storePath = new Path(path);
FSDataOutputStream fSDataOutputStream = null;
try {
fSDataOutputStream = fileSystem.create(storePath);
} catch (IOException e) {
String message = String.format("Create an FSDataOutputStream at the indicated Path[%s] failed: [%s]",
"message:path =" + path);
LOG.error(message);
throw DataXException.asDataXException(HdfsWriterErrorCode.Write_FILE_IO_ERROR, e);
}
return fSDataOutputStream;
}
/**
* 写textfile类型文件
*
* @param lineReceiver
* @param config
* @param fileName
* @param taskPluginCollector
*/
public void textFileStartWrite(RecordReceiver lineReceiver, Configuration config, String fileName,
TaskPluginCollector taskPluginCollector) {
char fieldDelimiter = config.getChar(Key.FIELD_DELIMITER);
List<Configuration> columns = config.getListConfiguration(Key.COLUMN);
String compress = config.getString(Key.COMPRESS, null);
SimpleDateFormat dateFormat = new SimpleDateFormat("yyyyMMddHHmm");
String attempt = "attempt_" + dateFormat.format(new Date()) + "_0001_m_000000_0";
Path outputPath = new Path(fileName);
//todo 需要进一步确定TASK_ATTEMPT_ID
conf.set(JobContext.TASK_ATTEMPT_ID, attempt);
FileOutputFormat outFormat = new TextOutputFormat();
outFormat.setOutputPath(conf, outputPath);
outFormat.setWorkOutputPath(conf, outputPath);
if (null != compress) {
Class<? extends CompressionCodec> codecClass = getCompressCodec(compress);
if (null != codecClass) {
outFormat.setOutputCompressorClass(conf, codecClass);
}
}
try {
RecordWriter writer = outFormat.getRecordWriter(fileSystem, conf, outputPath.toString(), Reporter.NULL);
Record record = null;
while ((record = lineReceiver.getFromReader()) != null) {
MutablePair<Text, Boolean> transportResult = transportOneRecord(record, fieldDelimiter, columns, taskPluginCollector);
if (!transportResult.getRight()) {
writer.write(NullWritable.get(), transportResult.getLeft());
}
}
writer.close(Reporter.NULL);
} catch (Exception e) {
String message = String.format("写文件文件[%s]时发生IO异常,请检查您的网络是否正常!", fileName);
LOG.error(message);
Path path = new Path(fileName);
deleteDir(path.getParent());
throw DataXException.asDataXException(HdfsWriterErrorCode.Write_FILE_IO_ERROR, e);
}
}
public static MutablePair<Text, Boolean> transportOneRecord(
Record record, char fieldDelimiter, List<Configuration> columnsConfiguration, TaskPluginCollector taskPluginCollector) {
MutablePair<List<Object>, Boolean> transportResultList = transportOneRecord(record, columnsConfiguration, taskPluginCollector);
//保存<转换后的数据,是否是脏数据>
MutablePair<Text, Boolean> transportResult = new MutablePair<Text, Boolean>();
transportResult.setRight(false);
if (null != transportResultList) {
Text recordResult = new Text(StringUtils.join(transportResultList.getLeft(), fieldDelimiter));
transportResult.setRight(transportResultList.getRight());
transportResult.setLeft(recordResult);
}
return transportResult;
}
public Class<? extends CompressionCodec> getCompressCodec(String compress) {
Class<? extends CompressionCodec> codecClass = null;
if (null == compress) {
codecClass = null;
} else if ("GZIP".equalsIgnoreCase(compress)) {
codecClass = org.apache.hadoop.io.compress.GzipCodec.class;
} else if ("BZIP2".equalsIgnoreCase(compress)) {
codecClass = org.apache.hadoop.io.compress.BZip2Codec.class;
} else if ("SNAPPY".equalsIgnoreCase(compress)) {
//todo 等需求明确后支持 需要用户安装SnappyCodec
codecClass = org.apache.hadoop.io.compress.SnappyCodec.class;
// org.apache.hadoop.hive.ql.io.orc.ZlibCodec.class not public
//codecClass = org.apache.hadoop.hive.ql.io.orc.ZlibCodec.class;
} else {
throw DataXException.asDataXException(HdfsWriterErrorCode.ILLEGAL_VALUE,
String.format("目前不支持您配置的 compress 模式 : [%s]", compress));
}
return codecClass;
}
/**
* 写orcfile类型文件
*
* @param lineReceiver
* @param config
* @param fileName
* @param taskPluginCollector
*/
public void orcFileStartWrite(RecordReceiver lineReceiver, Configuration config, String fileName,
TaskPluginCollector taskPluginCollector) {
List<Configuration> columns = config.getListConfiguration(Key.COLUMN);
String compress = config.getString(Key.COMPRESS, null);
List<String> columnNames = getColumnNames(columns);
List<ObjectInspector> columnTypeInspectors = getColumnTypeInspectors(columns);
StructObjectInspector inspector = (StructObjectInspector) ObjectInspectorFactory
.getStandardStructObjectInspector(columnNames, columnTypeInspectors);
OrcSerde orcSerde = new OrcSerde();
FileOutputFormat outFormat = new OrcOutputFormat();
if (!"NONE".equalsIgnoreCase(compress) && null != compress) {
Class<? extends CompressionCodec> codecClass = getCompressCodec(compress);
if (null != codecClass) {
outFormat.setOutputCompressorClass(conf, codecClass);
}
}
try {
RecordWriter writer = outFormat.getRecordWriter(fileSystem, conf, fileName, Reporter.NULL);
Record record = null;
while ((record = lineReceiver.getFromReader()) != null) {
MutablePair<List<Object>, Boolean> transportResult = transportOneRecord(record, columns, taskPluginCollector);
if (!transportResult.getRight()) {
writer.write(NullWritable.get(), orcSerde.serialize(transportResult.getLeft(), inspector));
}
}
writer.close(Reporter.NULL);
} catch (Exception e) {
String message = String.format("写文件文件[%s]时发生IO异常,请检查您的网络是否正常!", fileName);
LOG.error(message);
Path path = new Path(fileName);
deleteDir(path.getParent());
throw DataXException.asDataXException(HdfsWriterErrorCode.Write_FILE_IO_ERROR, e);
}
}
public List<String> getColumnNames(List<Configuration> columns) {
List<String> columnNames = Lists.newArrayList();
for (Configuration eachColumnConf : columns) {
columnNames.add(eachColumnConf.getString(Key.NAME));
}
return columnNames;
}
/**
* 根据writer配置的字段类型,构建inspector
*
* @param columns
* @return
*/
public List<ObjectInspector> getColumnTypeInspectors(List<Configuration> columns) {
List<ObjectInspector> columnTypeInspectors = Lists.newArrayList();
for (Configuration eachColumnConf : columns) {
SupportHiveDataType columnType = SupportHiveDataType.valueOf(eachColumnConf.getString(Key.TYPE).toUpperCase());
ObjectInspector objectInspector = null;
switch (columnType) {
case TINYINT:
objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Byte.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
break;
case SMALLINT:
objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Short.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
break;
case INT:
objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Integer.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
break;
case BIGINT:
objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Long.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
break;
case FLOAT:
objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Float.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
break;
case DOUBLE:
objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Double.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
break;
case TIMESTAMP:
objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(java.sql.Timestamp.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
break;
case DATE:
objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(java.sql.Date.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
break;
case STRING:
case VARCHAR:
case CHAR:
objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(String.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
break;
case BOOLEAN:
objectInspector = ObjectInspectorFactory.getReflectionObjectInspector(Boolean.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
break;
default:
throw DataXException
.asDataXException(
HdfsWriterErrorCode.ILLEGAL_VALUE,
String.format(
"您的配置文件中的列配置信息有误. 因为DataX 不支持数据库写入这种字段类型. 字段名:[%s], 字段类型:[%d]. 请修改表中该字段的类型或者不同步该字段.",
eachColumnConf.getString(Key.NAME),
eachColumnConf.getString(Key.TYPE)));
}
columnTypeInspectors.add(objectInspector);
}
return columnTypeInspectors;
}
public OrcSerde getOrcSerde(Configuration config) {
String fieldDelimiter = config.getString(Key.FIELD_DELIMITER);
String compress = config.getString(Key.COMPRESS);
String encoding = config.getString(Key.ENCODING);
OrcSerde orcSerde = new OrcSerde();
Properties properties = new Properties();
properties.setProperty("orc.bloom.filter.columns", fieldDelimiter);
properties.setProperty("orc.compress", compress);
properties.setProperty("orc.encoding.strategy", encoding);
orcSerde.initialize(conf, properties);
return orcSerde;
}
public static MutablePair<List<Object>, Boolean> transportOneRecord(
Record record, List<Configuration> columnsConfiguration,
TaskPluginCollector taskPluginCollector) {
MutablePair<List<Object>, Boolean> transportResult = new MutablePair<List<Object>, Boolean>();
transportResult.setRight(false);
List<Object> recordList = Lists.newArrayList();
int recordLength = record.getColumnNumber();
if (0 != recordLength) {
Column column;
for (int i = 0; i < recordLength; i++) {
column = record.getColumn(i);
//todo as method
if (null != column.getRawData()) {
String rowData = column.getRawData().toString();
SupportHiveDataType columnType = SupportHiveDataType.valueOf(
columnsConfiguration.get(i).getString(Key.TYPE).toUpperCase());
//根据writer端类型配置做类型转换
try {
switch (columnType) {
case TINYINT:
recordList.add(Byte.valueOf(rowData));
break;
case SMALLINT:
recordList.add(Short.valueOf(rowData));
break;
case INT:
recordList.add(Integer.valueOf(rowData));
break;
case BIGINT:
recordList.add(column.asLong());
break;
case FLOAT:
recordList.add(Float.valueOf(rowData));
break;
case DOUBLE:
recordList.add(column.asDouble());
break;
case STRING:
case VARCHAR:
case CHAR:
recordList.add(column.asString());
break;
case BOOLEAN:
recordList.add(column.asBoolean());
break;
case DATE:
recordList.add(new java.sql.Date(column.asDate().getTime()));
break;
case TIMESTAMP:
recordList.add(new java.sql.Timestamp(column.asDate().getTime()));
break;
default:
throw DataXException
.asDataXException(
HdfsWriterErrorCode.ILLEGAL_VALUE,
String.format(
"您的配置文件中的列配置信息有误. 因为DataX 不支持数据库写入这种字段类型. 字段名:[%s], 字段类型:[%d]. 请修改表中该字段的类型或者不同步该字段.",
columnsConfiguration.get(i).getString(Key.NAME),
columnsConfiguration.get(i).getString(Key.TYPE)));
}
} catch (Exception e) {
// warn: 此处认为脏数据
String message = String.format(
"字段类型转换错误:你目标字段为[%s]类型,实际字段值为[%s].",
columnsConfiguration.get(i).getString(Key.TYPE), column.getRawData().toString());
taskPluginCollector.collectDirtyRecord(record, message);
transportResult.setRight(true);
break;
}
} else {
// warn: it's all ok if nullFormat is null
recordList.add(null);
}
}
}
transportResult.setLeft(recordList);
return transportResult;
}
/**
* 写parquet类型文件
*
* @param lineReceiver
* @param config
* @param fileName
* @param taskPluginCollector
*/
public void parquetFileStartWrite(RecordReceiver lineReceiver, Configuration config, String fileName,
TaskPluginCollector taskPluginCollector) {
List<Configuration> columns = config.getListConfiguration(Key.COLUMN);
String compress = config.getString(Key.COMPRESS, null);
List<String> columnNames = getColumnNames(columns);
List<ObjectInspector> columnTypeInspectors = getColumnTypeInspectors(columns);
StructObjectInspector inspector = (StructObjectInspector) ObjectInspectorFactory
.getStandardStructObjectInspector(columnNames, columnTypeInspectors);
ParquetHiveSerDe parquetHiveSerDe = new ParquetHiveSerDe();
MapredParquetOutputFormat outFormat = new MapredParquetOutputFormat();
if (!"NONE".equalsIgnoreCase(compress) && null != compress) {
Class<? extends CompressionCodec> codecClass = getCompressCodec(compress);
if (null != codecClass) {
FileOutputFormat.setOutputCompressorClass(conf, codecClass);
}
}
try {
Properties colProperties = new Properties();
colProperties.setProperty("columns", String.join(",", columnNames));
List<String> colType = Lists.newArrayList();
columns.forEach(c -> colType.add(c.getString(Key.TYPE).toLowerCase()));
colProperties.setProperty("columns.types", String.join(",", colType));
RecordWriter writer = (RecordWriter) outFormat.getHiveRecordWriter(conf, new Path(fileName), ObjectWritable.class, true, colProperties, Reporter.NULL);
Record record = null;
while ((record = lineReceiver.getFromReader()) != null) {
MutablePair<List<Object>, Boolean> transportResult = transportOneRecord(record, columns, taskPluginCollector);
if (!transportResult.getRight()) {
writer.write(null, parquetHiveSerDe.serialize(transportResult.getLeft(), inspector));
}
}
writer.close(Reporter.NULL);
} catch (Exception e) {
String message = String.format("写文件文件[%s]时发生IO异常,请检查您的网络是否正常!", fileName);
LOG.error(message);
Path path = new Path(fileName);
deleteDir(path.getParent());
throw DataXException.asDataXException(HdfsWriterErrorCode.Write_FILE_IO_ERROR, e);
}
}
}