当你有这么个需求从某张hive表里做各类统计,完了之后落到各个分类的统计表里存储。自然而然我们会想到使用hive的Multi Insert 语句来实现。因为使用Multi Insert 语句可以避免多次扫描同一份原始表数据。本文记录一次使用Multi Insert 语句出现的GC overhead limit exceeded问题。

问题描述

我有这么个需求从某个域名相关的表里统计各个维度的数据落到相应的表里面。下面是我的SQL实例代码:

FROM qbox_bi_gold.domain_info INPUT
             INSERT OVERWRITE TABLE 5min PARTITION (day="20151130")
                  SELECT cast(time/300000 as bigint)*300000 AS time  , SUM(flow) AS flow,SUM(hits) AS hits
                  WHERE day="20151130"
                  GROUP BY cast(time/300000 as bigint)*300000
             INSERT OVERWRITE TABLE prov_5min PARTITION (day="20151130")
                   SELECT cast(time/300000 as bigint)*300000 AS time ,prov, SUM(flow) AS flow,SUM(hits) AS hits
                   WHERE day="20151130"
                   GROUP BY cast(time/300000 as bigint)*300000 ,prov
             INSERT OVERWRITE TABLE prov_uid_5min PARTITION (day="20151130")
                    SELECT cast(time/300000 as bigint)*300000 AS time ,prov,uid, SUM(flow) AS flow,SUM(hits) AS hits
                    WHERE day="20151130"
                    GROUP BY cast(time/300000 as bigint)*300000 ,prov,uid
              INSERT OVERWRITE TABLE bucket_prov_uid_5min PARTITION (day="20151130")
                    SELECT cast(time/300000 as bigint)*300000 AS time ,bucket,prov,uid, SUM(flow) AS flow,SUM(hits) AS hits
                    WHERE day="20151130"
                    GROUP BY cast(time/300000 as bigint)*300000 ,bucket,prov,uid
              INSERT OVERWRITE TABLE bucket_domain_prov_uid_5min PARTITION (day="20151130")
                    SELECT cast(time/300000 as bigint)*300000 AS time ,bucket,domain,prov,uid, SUM(flow) AS flow,SUM(hits) AS hits
                    WHERE day="20151130"
                    GROUP BY cast(time/300000 as bigint)*300000 ,bucket,domain,prov,uid
              INSERT OVERWRITE TABLE bucket_city_domain_prov_uid_5min PARTITION (day="20151130")
                    SELECT cast(time/300000 as bigint)*300000 AS time ,bucket,city,domain,prov,uid, SUM(flow) AS flow,SUM(hits) AS hits
                    WHERE day="20151130"
                    GROUP BY cast(time/300000 as bigint)*300000 ,bucket,city,domain,prov,uid

上述语句会产生6个Job,你可以使用explain hsql来查看执行解析流程:

STAGE DEPENDENCIES:
  Stage-6 is a root stage
  Stage-0 depends on stages: Stage-6
  Stage-7 depends on stages: Stage-0
  Stage-8 depends on stages: Stage-6
  Stage-1 depends on stages: Stage-8
  Stage-9 depends on stages: Stage-1
  Stage-10 depends on stages: Stage-6
  Stage-2 depends on stages: Stage-10
  Stage-11 depends on stages: Stage-2
  Stage-12 depends on stages: Stage-6
  Stage-3 depends on stages: Stage-12
  Stage-13 depends on stages: Stage-3
  Stage-14 depends on stages: Stage-6
  Stage-4 depends on stages: Stage-14
  Stage-15 depends on stages: Stage-4
  Stage-16 depends on stages: Stage-6
  Stage-5 depends on stages: Stage-16
  Stage-17 depends on stages: Stage-5

STAGE PLANS:
  Stage: Stage-6
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: input
            Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: time (type: bigint), flow (type: bigint), hits (type: bigint)
              outputColumnNames: time, flow, hits
              Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: sum(flow), sum(hits)
                keys: (UDFToLong((time / 300000)) * 300000) (type: bigint)
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: bigint)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: bigint)
                  Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col1 (type: bigint), _col2 (type: bigint)
            Select Operator
              expressions: time (type: bigint), prov (type: string), flow (type: bigint), hits (type: bigint)
              outputColumnNames: time, prov, flow, hits
              Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: sum(flow), sum(hits)
                keys: (UDFToLong((time / 300000)) * 300000) (type: bigint), prov (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
            Select Operator
              expressions: time (type: bigint), prov (type: string), uid (type: int), flow (type: bigint), hits (type: bigint)
              outputColumnNames: time, prov, uid, flow, hits
              Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: sum(flow), sum(hits)
                keys: (UDFToLong((time / 300000)) * 300000) (type: bigint), prov (type: string), uid (type: int)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3, _col4
                Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
            Select Operator
              expressions: time (type: bigint), bucket (type: string), prov (type: string), uid (type: int), flow (type: bigint), hits (type: bigint)
              outputColumnNames: time, bucket, prov, uid, flow, hits
              Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: sum(flow), sum(hits)
                keys: (UDFToLong((time / 300000)) * 300000) (type: bigint), bucket (type: string), prov (type: string), uid (type: int)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
            Select Operator
              expressions: time (type: bigint), bucket (type: string), domain (type: string), prov (type: string), uid (type: int), flow (type: bigint), hits (type: bigint)
              outputColumnNames: time, bucket, domain, prov, uid, flow, hits
              Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: sum(flow), sum(hits)
                keys: (UDFToLong((time / 300000)) * 300000) (type: bigint), bucket (type: string), domain (type: string), prov (type: string), uid (type: int)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
                Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
            Select Operator
              expressions: time (type: bigint), bucket (type: string), city (type: string), domain (type: string), prov (type: string), uid (type: int), flow (type: bigint), hits (type: bigint)
              outputColumnNames: time, bucket, city, domain, prov, uid, flow, hits
              Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: sum(flow), sum(hits)
                keys: (UDFToLong((time / 300000)) * 300000) (type: bigint), bucket (type: string), city (type: string), domain (type: string), prov (type: string), uid (type: int)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0), sum(VALUE._col1)
          keys: KEY._col0 (type: bigint)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: bigint)
            outputColumnNames: _col0, _col1, _col2
            Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: domain_areav1.5min

  Stage: Stage-0
    Move Operator
      tables:
          partition:
            day 20151130
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: domain_areav1.5min

  Stage: Stage-7
    Stats-Aggr Operator

  Stage: Stage-8
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col0 (type: bigint), _col1 (type: string)
              sort order: ++
              Map-reduce partition columns: _col0 (type: bigint), _col1 (type: string)
              Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
              value expressions: _col2 (type: bigint), _col3 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0), sum(VALUE._col1)
          keys: KEY._col0 (type: bigint), KEY._col1 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col3 (type: bigint)
            outputColumnNames: _col0, _col1, _col2, _col3
            Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: domain_areav1.prov_5min

  Stage: Stage-1
    Move Operator
      tables:
          partition:
            day 20151130
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: domain_areav1.prov_5min

  Stage: Stage-9
    Stats-Aggr Operator

  Stage: Stage-10
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: int)
              sort order: +++
              Map-reduce partition columns: _col0 (type: bigint), _col1 (type: string), _col2 (type: int)
              Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
              value expressions: _col3 (type: bigint), _col4 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0), sum(VALUE._col1)
          keys: KEY._col0 (type: bigint), KEY._col1 (type: string), KEY._col2 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3, _col4
          Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: int), _col3 (type: bigint), _col4 (type: bigint)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4
            Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: domain_areav1.prov_uid_5min

  Stage: Stage-2
    Move Operator
      tables:
          partition:
            day 20151130
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: domain_areav1.prov_uid_5min

  Stage: Stage-11
    Stats-Aggr Operator

  Stage: Stage-12
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: int)
              sort order: ++++
              Map-reduce partition columns: _col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: int)
              Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
              value expressions: _col4 (type: bigint), _col5 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0), sum(VALUE._col1)
          keys: KEY._col0 (type: bigint), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
          Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: int), _col4 (type: bigint), _col5 (type: bigint)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
            Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: domain_areav1.bucket_prov_uid_5min

  Stage: Stage-3
    Move Operator
      tables:
          partition:
            day 20151130
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: domain_areav1.bucket_prov_uid_5min

  Stage: Stage-13
    Stats-Aggr Operator

  Stage: Stage-14
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: int)
              sort order: +++++
              Map-reduce partition columns: _col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: int)
              Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
              value expressions: _col5 (type: bigint), _col6 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0), sum(VALUE._col1)
          keys: KEY._col0 (type: bigint), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
          Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: int), _col5 (type: bigint), _col6 (type: bigint)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
            Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: domain_areav1.bucket_domain_prov_uid_5min

  Stage: Stage-4
    Move Operator
      tables:
          partition:
            day 20151130
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: domain_areav1.bucket_domain_prov_uid_5min

  Stage: Stage-15
    Stats-Aggr Operator

  Stage: Stage-16
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: int)
              sort order: ++++++
              Map-reduce partition columns: _col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: int)
              Statistics: Num rows: 248368802 Data size: 106301849426 Basic stats: COMPLETE Column stats: NONE
              value expressions: _col6 (type: bigint), _col7 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0), sum(VALUE._col1)
          keys: KEY._col0 (type: bigint), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: string), KEY._col5 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
          Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: int), _col6 (type: bigint), _col7 (type: bigint)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
            Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 124184401 Data size: 53150924713 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: domain_areav1.bucket_city_domain_prov_uid_5min

  Stage: Stage-5
    Move Operator
      tables:
          partition:
            day 20151130
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: domain_areav1.bucket_city_domain_prov_uid_5min

  Stage: Stage-17
    Stats-Aggr Operator

从上面可以看到Stage-6 is a root stage。Stage-6是第一个需要完成的job,然而问题就出现在这里。GC overhead limit exceeded !!!
从失败的jobhistory里可以看到失败是发生在map阶段。

...
map = 99%,  reduce = 33%, Cumulative CPU 9676.12 sec
map = 100%,  reduce = 100%, Cumulative CPU 9686.12 sec

也就是发现在map阶段。我们先看看错误堆栈:

-12-01 18:21:02,424 INFO [communication thread] org.apache.hadoop.mapred.Task: Communication exception: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
    at sun.nio.cs.StreamDecoder.(StreamDecoder.java:250)
    at sun.nio.cs.StreamDecoder.(StreamDecoder.java:230)
    at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:69)
    at java.io.InputStreamReader.(InputStreamReader.java:74)
    at java.io.FileReader.(FileReader.java:72)
    at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:381)
    at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree(ProcfsBasedProcessTree.java:162)
    at org.apache.hadoop.mapred.Task.updateResourceCounters(Task.java:839)
    at org.apache.hadoop.mapred.Task.updateCounters(Task.java:978)
    at org.apache.hadoop.mapred.Task.access$500(Task.java:77)
    at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:727)
    at java.lang.Thread.run(Thread.java:745)

map阶段OutOfMemoryError: GC overhead limit exceeded。

问题分析

OMM的通用原因大家都知道。加内存嘛!呵呵,咱不是土豪,而且OMM的原因加内存不一定能解决,还是找找内因。那么解决OMM该怎么做呢?首先我们得清楚OMM的原因的可能。1. 内存确实不够程序使用。2. 程序存在内存泄露或者程序的不够高效。作为立志成为资深程序猿的人应该从第二个入手。好,我们先分析分析:

hive程序运行环境 :

系统46台 Ubuntu12.04, 8核心,32G Mem。Hadoop版本2.2.0 ,hive 0.12。数据100G+ Text。使用的队列最大大概是总体的40%。上述hive程序启动map数380左右,reduce数120左右。按理说这样的数量应该不算大。但问题是它确实OMM了。应为使用的时hive程序,不是自己写的。应该不大可能存在内存泄露的代码。那么应该是hive sql不合理,首先想到的是Multi Insert的效率问题。测试:分别跑单个Insert语句,即删掉一些Insert语句。
实例代码如下:

FROM qbox_bi_gold.domain_info INPUT
                      INSERT OVERWRITE TABLE 5min PARTITION (day="20151130")
                            SELECT cast(time/300000 as bigint)*300000 AS time  , SUM(flow) AS flow,SUM(hits) AS hits
                            WHERE day="20151130"
                            GROUP BY cast(time/300000 as bigint)*300000

FROM qbox_bi_gold.domain_info INPUT
           INSERT OVERWRITE TABLE prov_5min PARTITION (day="20151130")
                 SELECT cast(time/300000 as bigint)*300000 AS time ,prov, SUM(flow) AS flow,SUM(hits) AS hits
                 WHERE day="20151130"
                 GROUP BY cast(time/300000 as bigint)*300000 ,prov

FROM qbox_bi_gold.domain_info INPUT
         INSERT OVERWRITE TABLE prov_uid_5min PARTITION (day="20151130")
                            SELECT cast(time/300000 as bigint)*300000 AS time ,prov,uid, SUM(flow) AS flow,SUM(hits) AS hits
                            WHERE day="20151130"
                            GROUP BY cast(time/300000 as bigint)*300000 ,prov,uid

FROM qbox_bi_gold.domain_info INPUT
          INSERT OVERWRITE TABLE bucket_prov_uid_5min PARTITION (day="20151130")
                            SELECT cast(time/300000 as bigint)*300000 AS time ,bucket,prov,uid, SUM(flow) AS flow,SUM(hits) AS hits
                            WHERE day="20151130"
                            GROUP BY cast(time/300000 as bigint)*300000 ,bucket,prov,uid

FROM qbox_bi_gold.domain_info INPUT
          INSERT OVERWRITE TABLE bucket_domain_prov_uid_5min PARTITION (day="20151130")
                            SELECT cast(time/300000 as bigint)*300000 AS time ,bucket,domain,prov,uid, SUM(flow) AS flow,SUM(hits) AS hits
                            WHERE day="20151130"
                            GROUP BY cast(time/300000 as bigint)*300000 ,bucket,domain,prov,uid
FROM qbox_bi_gold.domain_info INPUT
          INSERT OVERWRITE TABLE bucket_city_domain_prov_uid_5min PARTITION (day="20151130")
                            SELECT cast(time/300000 as bigint)*300000 AS time ,bucket,city,domain,prov,uid, SUM(flow) AS flow,SUM(hits) AS hits
                            WHERE day="20151130"
                            GROUP BY cast(time/300000 as bigint)*300000 ,bucket,city,domain,prov,uid

结果都是能够跑出来的。也就是说Multi Insert是比较耗费内存导致OMM,并不是sql程序的问题。那么最大的原因是我们给程序(mapreduce)的内存过小。那么先看下我们到底配置了多大的内存。
在hive cli里执行下面命令:

hive> set mapreduce.map.java.opts;
mapreduce.map.java.opts=-Xmx1500m

hive> set mapreduce.reduce.java.opts;
mapreduce.reduce.java.opts=-Xmx2048m

hive> set mapreduce.map.memory.mb;
mapreduce.map.memory.mb=2048

hive> set mapreduce.reduce.memory.mb;
mapreduce.reduce.memory.mb=3072

我们的程序问题出现在map阶段OMM,所以应该是map的内存设置小了(mapreduce.map.java.opts=1.5g)。也是设置大点,但是不能操作map允许的最大值 mapreduce.map.memory.mb(这里为2g)。

总结:

对于内存问题导致的OMM我们需要从两点入手:

程序是否有内存泄露
内存是否确实设置过小

对于第一个首先排查程序问题。在上面案例中我们使用了Multi Insert导致内存不够Gc。这里你就会问了什么是GC overhead limit exceeded 而不是Java heap space?

GC overhead limit exceeded的解释: 
一、异常描述: 
Exception in thread “main” java.lang.OutOfMemoryError: GC overhead limit exceeded

二、解释: 
JDK6新增错误类型。当GC为释放很小空间占用大量时间时抛出。 一般是因为堆太小。 
导致异常的原因:没有足够的内存。

三、解决方案: 
1、查看系统是否有使用大内存的代码或死循环。 
2、可以添加JVM的启动参数来限制使用内存:-XX:-UseGCOverheadLimit

所以对于本案例来说我的优化如下:

set mapreduce.map.java.opts=-Xmx1800m -XX:-UseGCOverheadLimit