今天在执行job时发现如下错误:
- 11/08/12 15:19:33 INFO mapred.JobClient: map 99% reduce 32%
- 11/08/12 15:20:59 INFO mapred.JobClient: map 99% reduce 33%
- 11/08/12 15:21:10 INFO mapred.JobClient: map 100% reduce 33%
- 11/08/12 15:21:34 INFO mapred.JobClient: Task Id : attempt_201108021504_30459_m_000368_0, Status : FAILED
- Too many fetch-failures
- 11/08/12 15:22:35 WARN mapred.JobClient: Error reading task outputRead timed out
- 11/08/12 15:22:36 INFO mapred.JobClient: map 100% reduce 34%
- 11/08/12 15:24:44 INFO mapred.JobClient: Task Id : attempt_201108021504_30459_m_000392_0, Status : FAILED
- Too many fetch-failures
- 11/08/12 15:25:44 WARN mapred.JobClient: Error reading task outputRead timed out
- 11/08/12 15:25:45 INFO mapred.JobClient: map 100% reduce 67%
- 11/08/12 15:25:56 INFO mapred.JobClient: Job complete: job_201108021504_30459
- 11/08/12 15:25:56 INFO mapred.JobClient: Counters: 26
- 11/08/12 15:25:56 INFO mapred.JobClient: Job Counters
- 11/08/12 15:25:56 INFO mapred.JobClient: Launched reduce tasks=303
- 11/08/12 15:25:56 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=8591016
Reduce task启动后第一个阶段是shuffle,即向map端fetch数据。每次fetch都可能因为connect超时,read超时,checksum错误等原因而失败。Reduce task为每个map设置了一个计数器,用以记录fetch该map输出时失败的次数。当失败次数达到一定阈值时,会通知JobTracker fetch该map输出操作失败次数太多了,并打印如下log:
Failed to fetch map-output from attempt_201105261254_102769_m_001802_0 even after MAX_FETCH_RETRIES_PER_MAP retries... reporting to the JobTracker
其中阈值计算方式为:
max(MIN_FETCH_RETRIES_PER_MAP,
getClosestPowerOf2((this.maxBackoff * 1000 / BACKOFF_INIT) + 1));
默认情况下MIN_FETCH_RETRIES_PER_MAP=2 maxBackoff=300 BACKOFF_INIT=4000,因此默认阈值为6,可通过修改mapred.reduce.copy.backoff参数来调整。当达到阈值后,Reduce task通过umbilical协议告诉TaskTracker,TaskTracker在下一次heartbeat时,通知JobTracker。当JobTracker发现超过50%的Reduce汇报fetch某个map的输出多次失败后,JobTracker会failed掉该map并重新调度,打印如下log:
"Too many fetch-failures for output of task: attempt_201105261254_102769_m_001802_0 ... killing it"