第一个任务执行完成后,释放资源阶段(删除本地H2数据库中所有记录)报错,堆栈如下:
2020-04-08 10:09:19 INFO - [ProcessorTracker-1586311659084] mission complete, ProcessorTracker already destroyed!
2020-04-08 10:09:19 ERROR - [TaskPersistenceService] deleteAllTasks failed, instanceId=1586311659084.
java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at CommonUtils.executeWithRetry(CommonUtils.java:34)
at TaskPersistenceService.execute(TaskPersistenceService.java:297)
at TaskPersistenceService.deleteAllTasks(TaskPersistenceService.java:269)
at CommonTaskTracker.destroy(TaskTracker.java:231)
at CommonTaskTracker$StatusCheckRunnable.innerRun(TaskTracker.java:421)
at CommonTaskTracker$StatusCheckRunnable.run(TaskTracker.java:467)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2020-04-08 10:09:19 WARN - [TaskTracker-1586311659084] delete tasks from database failed.
2020-04-08 10:09:19 INFO - [TaskTracker-1586311659084] TaskTracker has left the world.
随后,Server派发下来的第二个任务也无法完成创建,异常堆栈如下:
2020-04-08 10:10:08 ERROR - [TaskPersistenceService] save taskTaskDO{taskId='0', jobId='1', instanceId='1586311804030', taskName='OMS_ROOT_TASK', address='10.37.129.2:2777', status=1, result='null', failedCnt=0, createdTime=1586311808295, lastModifiedTime=1586311808295} failed.
2020-04-08 10:10:08 ERROR - [TaskTracker-1586311804030] create root task failed.
[ERROR] [04/08/2020 10:10:08.511] [oms-akka.actor.internal-dispatcher-20] [akka://oms/user/task_tracker] create root task failed.
java.lang.RuntimeException: create root task failed.
at CommonTaskTracker.persistenceRootTask(TaskTracker.java:208)
at CommonTaskTracker.<init>(TaskTracker.java:81)
at TaskTrackerActor.lambda$onReceiveServerScheduleJobReq$2(TaskTrackerActor.java:138)
at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
at TaskTrackerPool.atomicCreateTaskTracker(TaskTrackerPool.java:30)
at TaskTrackerActor.onReceiveServerScheduleJobReq(TaskTrackerActor.java:138)
原因及解决方案:destroy方法调用了scheduledPool.shutdownNow()方法导致调用该方法的线程池被强制关闭,该方法也自然被中断,数据删到一半没删掉,破坏了数据库结构,后面的insert自然也就失败了。
原因:SQL中的now()函数返回的是Datetime,不能用ing/bigint去接收...
问题:java.lang.management.OperatingSystemMXBean#getSystemLoadAverage 不一定能获取CPU当前负载,可能返回负数代表不可用... 解决方案:印度Windows上getSystemLoadAverage()固定返回-1...太坑了...先做个保护性判断继续测试吧...
问题:秒级Broadcast任务在第四次执行时,当Processor完成执行上报状态时,TaskTracker报错,错误的本质原因是无法从数据库中找到这个task对应的记录... 场景:时间表达式:FIX_DELAY,对应的TaskTracker为FrequentTaskTracker
异常堆栈
2020-04-16 18:05:09 ERROR - [TaskPersistenceService] getTaskStatus failed, instanceId=1586857062542,taskId=4.
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.LinkedList.checkElementIndex(LinkedList.java:555)
at java.util.LinkedList.get(LinkedList.java:476)
at TaskPersistenceService.lambda$getTaskStatus$10(TaskPersistenceService.java:214)
at CommonUtils.executeWithRetry(CommonUtils.java:37)
at TaskPersistenceService.execute(TaskPersistenceService.java:310)
at TaskPersistenceService.getTaskStatus(TaskPersistenceService.java:212)
at TaskTracker.updateTaskStatus(TaskTracker.java:107)
at TaskTracker.broadcast(TaskTracker.java:214)
at TaskTrackerActor.onReceiveBroadcastTaskPreExecuteFinishedReq(TaskTrackerActor.java:106)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:187)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:186)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:241)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:242)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:242)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:242)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:242)
at akka.actor.Actor.aroundReceive(Actor.scala:534)
at akka.actor.Actor.aroundReceive$(Actor.scala:532)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:573)
at akka.actor.ActorCell.invoke(ActorCell.scala:543)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:269)
at akka.dispatch.Mailbox.run(Mailbox.scala:230)
at akka.dispatch.Mailbox.exec(Mailbox.scala:242)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
2020-04-16 18:05:09 WARN - [TaskTracker-1586857062542] query TaskStatus from DB failed when try to update new TaskStatus(taskId=4,newStatus=6).
解决方案:初步怀疑在连续更改时,由于数据库锁的存在导致行不可见(不知道H2具体的特性)。因此,需要保证同一个taskId串行更新 -> synchronize Yes!
破坏测试:指定错误的处理器 -> 发现问题,会造成死锁(TT创建PT,PT创建失败,无法定期汇报心跳,TT长时间未收到PT心跳,认为PT宕机(确实宕机了),无法选择可用的PT再次派发任务,死锁形成,GG斯密达 T_T)。通过确保ProcessorTracker一定能创建成功解决,如果处理器构建失败,之后所有提交的任务直接返回错误。
StopInstance -> success
FetchInstanceStatus -> success