最近线上的的nm 有crash的问题,查看错误日志:
2014-06-19 00:01:22,308 FATAL org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Error: Shutting downjava.util. ConcurrentModificationException at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:761) at java.util.LinkedList$ListItr.next(LinkedList.java:696) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.toString(LocalizedResource.java:120) at java.lang.String.valueOf(String.java:2826) at java.lang.StringBuilder.append(StringBuilder.java:115) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:656) 2014-06-19 00:01:22,308 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Public cache exiting 2014-06-19 00:03:40,685 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://bipcluster/tmp/hive-hdfs/hive_2014-06-19_00-05-51_049_5891972191087895437/-mr-10004/a1495555-b0dc-4356-8b68-1c881012e123, 1403107405580, FILE, null } 2014-06-19 00:03:40,685 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread java.util.concurrent.RejectedExecutionException at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1768) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) at java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:152) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:618) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:514) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:456) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:128) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) at java.lang.Thread.run(Thread.java:662) 2014-06-19 00:03:40,685 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye.
是在做resource localize时多线程的并发更新问题导致nm异常退出
这是一个bug,bug id:
https://issues.apache.org/jira/browse/YARN-573
bug描述:
Shared data structures in Public Localizer and Private Localizer are not Thread safe. PublicLocalizer 1) pending accessed by addResource (part of event handling) and run method (as a part of PublicLocalizer.run() ). PrivateLocalizer (LocalizerRunner?) 1) pending accessed by addResource (part of event handling) and findNextResource (i.remove()). Also update method should be fixed. It too is sharing pending list.
控制resource localize的有两个线程
PublicLocalizer 和 LocalizerRunner,一个用来控制public文件的下载,一个用来控制private文件的下载,两者都会操作pending,fix的方法就是增加同步,这个bug已经在cdh5.2.0的yarn中fix了。
关于触发java.util.ConcurrentModificationException的异常可以参考:
http://examples.javacodegeeks.com/java-basics/exceptions/java-util-concurrentmodificationexception-how-to-handle-concurrent-modification-exception/