针对Spark Streaming,为了保证数据尽量不丢失,自己管理offset
采用手动提交offset to zk的方案:
2017-10-26 11:46:22 Executor task launch worker-3 org.apache.spark.streaming.kafka.MyKafkaRDD INFO:Computing topic datamining, partition 8 offsets 3883 -> 3903
这里的offset错误出现一次,然后offset 在下一次错误的时候递增了,意味着中间的kafka数据丢失掉了。
TODO :需要测试自带checkpoint是否出现这个问题。
贴上一段日志,这里的数据在处理过程中,网络连接中断,导致consumer的消费连接出现中断,数据丢失,但是offset却递增 丢失。
2017-10-26 11:46:22 Executor task launch worker-1 kafka.utils.VerifiableProperties INFO:Property zookeeper.connect is overridden to
2017-10-26 11:46:22 task-result-getter-3 org.apache.spark.scheduler.TaskSetManager WARN:Lost task 2.0 in stage 494.0 (TID 1972, localhost): java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:110)
at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:98)
at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:83)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:132)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:132)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:132)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:131)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:131)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:131)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:130)
at org.apache.spark.streaming.kafka.MyKafkaRDD$KafkaRDDIterator.fetchBatch(MyKafkaRDD.scala:192)
at org.apache.spark.streaming.kafka.MyKafkaRDD$KafkaRDDIterator.getNext(MyKafkaRDD.scala:208)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
2017-10-26 11:46:22 Executor task launch worker-3 org.apache.spark.executor.Executor ERROR:Exception in task 3.0 in stage 494.0 (TID 1973)
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:110)
at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:98)
at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:83)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:132)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:132)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:132)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:131)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:131)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:131)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:130)
at org.apache.spark.streaming.kafka.MyKafkaRDD$KafkaRDDIterator.fetchBatch(MyKafkaRDD.scala:192)
at org.apache.spark.streaming.kafka.MyKafkaRDD$KafkaRDDIterator.getNext(MyKafkaRDD.scala:208)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
2017-10-26 11:46:22 dispatcher-event-loop-3 org.apache.spark.scheduler.TaskSetManager INFO:Starting task 2.0 in stage 496.0 (TID 1977, localhost, partition 2,ANY, 2004 bytes)
2017-10-26 11:46:22 Executor task launch worker-3 org.apache.spark.executor.Executor INFO:Running task 2.0 in stage 496.0 (TID 1977)
2017-10-26 11:46:22 Executor task launch worker-3 org.apache.spark.streaming.kafka.MyKafkaRDD INFO:Computing topic datamining, partition 8 offsets 3883 -> 3903
2017-10-26 11:46:22 Executor task launch worker-3 kafka.utils.VerifiableProperties INFO:Verifying properties