一、背景

        BI集群,有60多个节点,2P+数据,机器都已经运行了3年以上

二、现象

        提交hive任务会经常失败,有时候能成功,上午失败概率大,下午成功的概率大。

        异常日志:

        日志1、

2021-09-30 08:28:35.451 [AMRM Callback Handler Thread] INFO com.aaa.lever.master.RMCallbackHandler.onContainersCompleted(RMCallbackHandler.java:77)  -->  got container status for containerID=container_e155_1632330508050_62782_01_000002, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch.
Container id: container_e155_1632330508050_62782_01_000002
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
	at org.apache.hadoop.util.Shell.run(Shell.java:455)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1

2021-09-30 08:28:35.602 [main] INFO com.aaa.lever.master.LeverMasterManipulator.finish(LeverMasterManipulator.java:185)  --> Application completed. Stopping running containers
2021-09-30 08:28:35.614 [main] INFO com.aaa.lever.master.LeverMasterManipulator.finish(LeverMasterManipulator.java:189)  --> Application completed. Signalling finish to RM
2021-09-30 08:28:35.722 [main] INFO com.aaa.lever.master.LeverMaster.main(LeverMaster.java:58)  --> Application Master failed:Exception from container-launch.
Container id: container_e155_1632330508050_62782_01_000002
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
	at org.apache.hadoop.util.Shell.run(Shell.java:455)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1

2021-09-30 08:28:35.723 [main] INFO com.aaa.lever.master.LeverMaster.main(LeverMaster.java:59)  --> exiting now

        日志2、

Exception in thread "main" java.lang.RuntimeException: java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
	at com.aaa.lever.task.SyncTask.call(SyncTask.java:58)
	at com.aaa.lever.action.SqlActionMain.executeSql(SqlActionMain.java:119)
	at com.aaa.lever.action.SqlActionMain.main(SqlActionMain.java:86)
Caused by: java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
	at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296)
	at org.apache.hive.jdbc.HiveStatement.executeUpdate(HiveStatement.java:406)
	at org.apache.hive.jdbc.HivePreparedStatement.executeUpdate(HivePreparedStatement.java:119)
	at com.aaa.lever.task.SqlExecutorTask.doTask(SqlExecutorTask.java:110)
	at com.aaa.lever.task.SyncTask.call(SyncTask.java:45)
	... 2 more

三、调查思路

        1、怀疑跟之前hadoop集群的异常一样,是因为单个节点问题导致的,结果节点问题修复以后,hive的问题依然存在。

        2、根据日志1进行分析,调查各种exit dode = 1的问题

        因为这个日志没有具体表现,所以还需要找更具体的日志才是真正的原因。

        3、根据日志2进行分析,搜索到的大多数是

Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask“

这个异常,说是yarn集群资源不足或者权限的问题,

而我们这个问题的异常是

“Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask“

一个MapRedTask,一个MapredLocalTask,

MapredLocalTask是跟本地任务有关,hive为了提高效率会自动把common join改为map join,这样任务会在本地运行join操作,如果本地内存不够,就会报错。

每天凌晨到上午这个时间段,集群运行任务较多,内存占用率高,个别机器内存处于满负荷状态,如果在这些负载高的机器上进行本地操作的话,内存是不够用的,所以报错的概率大。

下午的时候,集群任务少,比较空闲,负载低,所以大多数任务都能成功。

四、解决方案

        把hive.auto.convert.join和hive.auto.convert.join.noconditionaltask参数值改为false

        这样不会自动转化join为map join,不会在某个节点本地执行join任务,但是会牺牲一部分性能。

hive终止正在运行的sql_大数据

五、结论

        1、日志很重要,一定要找对异常日志。

        2、量变引起质变,在量少的情况下是优化的操作,等到量大时没准就会出问题,所以性能优化不是一成不变的,需要具体情况具体分析。