We have a ThreadPoolExecutor-back task execution module. Some tasks are apparently finished in the application codes but the worker code somehow skip all the try/catch/finally and just go pick up the next task, causing the previous one to miss important status reporting code.
Like in this example, 257832:5 was picked up by thread-8 but this thread eventually just start another task.
2012-07-11 15:53:39,389 INFO [pool-1-thread-6]: Task (258861:5) : Starting. 2012-07-11 15:53:39,389 INFO [pool-1-thread-6]: A task running ### logged by app logic 2012-07-11 15:54:18,186 INFO [pool-1-thread-6]: Task (258868:5) : Starting. ### pick up another suddenly! 2012-07-11 15:54:18,186 INFO [pool-1-thread-6]: A task running ### logged by app logic 2012-07-11 15:54:18,445 INFO [pool-1-thread-6]: Task (258868:5) : returned from Task.doWork. 2012-07-11 15:54:18,446 INFO [pool-1-thread-6]: Task (258868:5) : finalizing. 2012-07-11 15:54:18,487 INFO [pool-1-thread-6]: Task (258868:5) : notifying status result: 200. 2012-07-11 15:54:18,487 INFO [pool-1-thread-6]: Task (258868:5) : Finished
The Runnable for the ThreadPoolExecutor) looks fine. It looks like
public void run() {
log.info(String.format("Task (%s:%s) : Starting.", wfId, taskId));
try {
Object result = task.doWork(...); // call the application codes
log.info(String.format("Task (%s:%s) : returned from Task.doWork.", wfId, taskId));
status = "DONE";
} catch (Throwable t) {
log.error("Error executing task: " + taskId + " - " + className);
} finally {
log.info(String.format("Task (%s:%s) : finalizing.", wfId, taskId));
// notify status to a server
log.info(String.format("Task (%s:%s) : Finished", wfId, taskId));
}
}
// the task framework looks like
// Use a queue size larger than numThreads, since the ThreadPoolExecutor#getActiveCount() call only returns an approximate count
this.executor = new ThreadPoolExecutor(numThreads, numThreads, 60, TimeUnit.SECONDS, new ArrayBlockingQueue<Runnable>(numThreads * 2));
executor.execute(new TaskRunnable(...));
Note:
- we catch the ultimate Throwable but there is no exception logged between the two tasks
- there is no sign that exit() is called or program restart
- the first one skipped all the log lines, whether the one right after the call to the app code or the ones in the catch and finally blocks, and skip the status reporting codes
- this only happens randomly with low probability; but due to the large number of tasks executed, it still causes us a lot of headache.
It happens as if the thread pool executor simply evicts the running runnable magically (else it would require an InterruptedException which would be caught as a Throwable; Java threads are not to be stopped non-cooperatively except during a shutdown/exit) and skip all the blocks. I checked the ThreeadPoolExecutor javadoc, nothing should cause such event to happen.
What could have happened?