今天产品气势汹汹的跑过来跟我说,生产环境的实时程序不对啊!!!!添加数据一直看不到展示
开始不可能三连: 1.不可能吧,早上还好端端的
2.不可能会出问题的,可能是网络延迟
3.不可能出不来啊,是不是你没清缓存,你清缓存试试看
打发了产品回去之后,立马登上去看要不然要祭旗了。。。
1.排查
1.1 yarn 运行的好端端的,没收到电话跟短信是正常的(ps:这里加了监控实时任务挂掉的话是会电话短信通知,呼爆你的。。。。)
1.2 看看JobManager 有没有挂,webui正常打开,我的天居然在Running job上没有看到任务,看来这锅是要背起来了,左右看看能不能把这锅甩出去.....
【error:Internal server error.】
1.4 关键时刻还是得靠yarn老大哥的log
2020-08-20 09:22:52,687 INFO org.apache.flink.yarn.YarnResourceManager - The heartbeat of TaskManager with id container_e14_1594608422123_4297_01_000002 timed out.
2020-08-20 09:22:52,688 INFO org.apache.flink.yarn.YarnResourceManager - Closing TaskExecutor connection container_e14_1594608422123_4297_01_000002 because: The heartbeat of TaskManager with id container_e14_1594608422123_4297_01_000002 timed out.
2020-08-20 09:22:52,692 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source -> (Sink: Print to Std. Out, Map -> Filter) (1/3) (cbe68593452a4ede5106d642b57c5b4d) switched from RUNNING to FAILED.
java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id container_e14_1594608422123_4297_01_000002 timed out.
at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1125)
at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:109)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2020-08-20 09:22:52,695 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy - Calculating tasks to restart to recover the failed task e3dfc0d7e9ecd8a43f85f0b68ebf3b80_0.
2020-08-20 09:22:52,695 INFO org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy - 1 tasks should be restarted to recover the failed task e3dfc0d7e9ecd8a43f85f0b68ebf3b80_0.
2020-08-20 09:22:52,696 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Label2ES Streaming (3dc5e025bf4569cf32f17438317a13d1) switched from state RUNNING to FAILING.
org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:110)
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:76)
at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:186)
at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:180)
at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:484)
at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
at org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1703)
at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1252)
at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1220)
at org.apache.flink.runtime.executiongraph.Execution.fail(Execution.java:955)
at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.signalPayloadRelease(SingleLogicalSlot.java:173)
at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.release(SingleLogicalSlot.java:165)
at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:732)
at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
at org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149)
at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:730)
at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:710)
at org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:541)
at org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:667)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:274)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:194)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id container_e14_1594608422123_4297_01_000002 timed out.
at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1125)
at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:109)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
... 20 more
2020-08-20 09:22:52,698 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source -> (Sink: Print to Std. Out, Map -> Filter) (2/3) (0d127d02ced19432001f821da02cdc8c) switched from RUNNING to CANCELING.
2020-08-20 09:22:52,700 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source -> (Sink: Print to Std. Out, Map -> Filter) (3/3) (2486309ff7807d190e2131e2dde46a3d) switched from RUNNING to CANCELING.
2020-08-20 09:22:52,700 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Discarding the results produced by task execution cbe68593452a4ede5106d642b57c5b4d.
2020-08-20 09:22:52,702 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source -> (Sink: Print to Std. Out, Map -> Filter) (2/3) (0d127d02ced19432001f821da02cdc8c) switched from CANCELING to CANCELED.
2020-08-20 09:22:52,702 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Discarding the results produced by task execution 0d127d02ced19432001f821da02cdc8c.
2020-08-20 09:22:52,702 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source -> (Sink: Print to Std. Out, Map -> Filter) (3/3) (2486309ff7807d190e2131e2dde46a3d) switched from CANCELING to CANCELED.
2020-08-20 09:22:52,702 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Discarding the results produced by task execution 2486309ff7807d190e2131e2dde46a3d.
2020-08-20 09:22:52,702 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Label2ES Streaming (3dc5e025bf4569cf32f17438317a13d1) switched from state FAILING to FAILED.
org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:110)
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:76)
at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:186)
at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:180)
at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:484)
at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
at org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1703)
at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1252)
at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1220)
at org.apache.flink.runtime.executiongraph.Execution.fail(Execution.java:955)
at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.signalPayloadRelease(SingleLogicalSlot.java:173)
at org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.release(SingleLogicalSlot.java:165)
at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:732)
at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
at org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149)
at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.tryFailingAllocatedSlot(SlotPoolImpl.java:730)
at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.failAllocation(SlotPoolImpl.java:710)
at org.apache.flink.runtime.jobmaster.JobMaster.internalFailAllocation(JobMaster.java:541)
at org.apache.flink.runtime.jobmaster.JobMaster.notifyAllocationFailure(JobMaster.java:667)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:274)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:194)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id container_e14_1594608422123_4297_01_000002 timed out.
at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(ResourceManager.java:1125)
at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:109)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
... 20 more
2020-08-20 09:22:52,703 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping checkpoint coordinator for job 3dc5e025bf4569cf32f17438317a13d1.
2020-08-20 09:22:52,703 INFO org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore - Shutting down
2020-08-20 09:22:52,706 INFO org.apache.flink.runtime.dispatcher.MiniDispatcher - Job 3dc5e025bf4569cf32f17438317a13d1 reached globally terminal state FAILED.
2020-08-20 09:22:52,707 INFO org.apache.flink.runtime.jobmaster.JobMaster - Stopping the JobMaster for job Label2ES Streaming (3dc5e025bf4569cf32f17438317a13d1).
2020-08-20 09:22:52,708 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Suspending SlotPool.
2020-08-20 09:22:52,708 INFO org.apache.flink.runtime.jobmaster.JobMaster - Close ResourceManager connection 185c1b05023a28dabdc753bafaeea7b2: JobManager is shutting down..
2020-08-20 09:22:52,708 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Stopping SlotPool.
2020-08-20 09:22:52,708 INFO org.apache.flink.yarn.YarnResourceManager - Disconnect job manager 00000000000000000000000000000000@akka.tcp://flink@10.0.0.98:32812/user/jobmanager_0 for job 3dc5e025bf4569cf32f17438317a13d1 from the resource manager.
2020-08-20 09:22:52,709 INFO org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl - JobManagerRunner already shutdown.
2020-08-20 09:24:17,137 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [/10.0.0.98:54188] failed with java.io.IOException: Connection reset by peer
2020-08-20 09:24:17,143 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@10.0.0.98:45661] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2020-08-20 09:24:17,143 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink-metrics@10.0.0.98:35742] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2020-08-20 15:27:49,595 ERROR org.apache.flink.runtime.rest.handler.taskmanager.TaskManagerDetailsHandler - Unhandled exception.
org.apache.flink.runtime.resourcemanager.exceptions.UnknownTaskExecutorException: No TaskExecutor registered under container_e14_1594608422123_4297_01_000002.
at org.apache.flink.runtime.resourcemanager.ResourceManager.requestTaskManagerInfo(ResourceManager.java:532)
at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:279)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:194)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2020-08-20 15:27:52,605 ERROR org.apache.flink.runtime.rest.handler.taskmanager.TaskManagerDetailsHandler - Unhandled exception.
org.apache.flink.runtime.resourcemanager.exceptions.UnknownTaskExecutorException: No TaskExecutor registered under container_e14_1594608422123_4297_01_000002.
at org.apache.flink.runtime.resourcemanager.ResourceManager.requestTaskManagerInfo(ResourceManager.java:532)
at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:279)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:194)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
日志太长直接看主要的地方吧:
2. 定位问题
2.1 主要的问题就是超时异常,连接不上容器container 了
java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id container_e14_1594608422123_4297_01_000002 timed out.
2.2 查看是否是因为资源不够,网络延迟等引起的问题,发现只有一个container,一个core,这个时候有点疑惑了为什么只有一个呢?
2.3 先理解一下flink on yarn perjob模式中的taskslots 是如何对应yarn中的container 跟core数的
在服务器上运行,具体的参数解释可以看一下这篇博客:
flink run --help
这里主要了解 -ys ,以及 -p 参数
参数 | 释义 | 理解 |
-ys | Number of slots per TaskManager | 每一个taskmanager里面有多少个slots |
-p | The parallelism with which to run the program. Optional flag to override the default value specified in the configuration. | 设置每个任务的并行度 |
yarn 中的core数 对应taskmanager的slots数量
yarn中的container数 = taskmanager 数量(理解是节点数)= -p / -ys
测试求证:
运行如下命令
flink run -m yarn-cluster -ynm mainname -ys 3 -p 30 -yjm 10240 -ytm 20480 -c com.group.mainclass xxx.jar
图示如下请看:
3.解决问题
3.1 原来的执行命令是 -ys 3 -p 3 ,只有一个taskmanager,假设这个挂了程序也就挂了
flink run -m yarn-cluster -ynm mainname -ys 3 -p 3 -yjm 1024 -ytm 2048 -c com.group.mainclass xxx.jar
3.2 命令修改如下,(最好结合自身的业务场景设置并不是越大越好)
flink run -m yarn-cluster -ynm mainname -ys 4 -p 20 -yjm 10240 -ytm 20480 -c com.group.mainclass xxx.jar
这样子虽然程序已经正常跑起来了,可是还是没有解决我标题的问题,为什么在yarn中应用还是Running,在WebUI中已经是failed掉了?????
3.3 查看参数 -d
-d,--detached | If present,runs the job in detached mode |
可是我的一开始就是YarnJobClusterEntrypoint
死马当活马医了,我先加上如果后面没有补充博客内容,就是可以了,否则.....没有否则看天意吧。
4. 最终执行命令
flink run -d -m yarn-cluster -ynm mainname -ys 4 -p 20 -yjm 10240 -ytm 20480 -c com.group.mainclass xxx.jar
2020-08-27 15:46:05------------------------------------------------------------------------------------------------------------------------
好的,我胡汉三又回来了,上面的坑终于填上了,亲测上面的参数 -d 可行,今天程序又停了,经历了一波电话轰炸滋味挺美妙的,有点怀念程序失败了还在yarn running的时候。