Using Storm 1.2.2
Kafka 1.1.0
After submitting topology, supervisor launches a worker processes. When checking the worker.log file for that launched Worker Process, it was found out that, somewhere between loading of all the executors, worker process gets killed by supervisor.
Following are the supervisor logs,
{"@timestamp":"2020-01-09 11:18:57,719","message":"SLOT 6700: Assignment Changed from LocalAssignment(topology_id:trident-Topology-578320979, executors:[ExecutorInfo(task_start:22, task_end:22), ExecutorInfo(task_start:2, task_end:2), ExecutorInfo(task_start:42, task_end:42), ExecutorInfo(task_start:18, task_end:18), ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:14, task_end:14), ExecutorInfo(task_start:6, task_end:6), ExecutorInfo(task_start:38, task_end:38), ExecutorInfo(task_start:30, task_end:30), ExecutorInfo(task_start:34, task_end:34), ExecutorInfo(task_start:50, task_end:50), ExecutorInfo(task_start:46, task_end:46), ExecutorInfo(task_start:26, task_end:26), ExecutorInfo(task_start:39, task_end:39), ExecutorInfo(task_start:47, task_end:47), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:51, task_end:51), ExecutorInfo(task_start:3, task_end:3), ExecutorInfo(task_start:35, task_end:35), ExecutorInfo(task_start:31, task_end:31), ExecutorInfo(task_start:27, task_end:27), ExecutorInfo(task_start:43, task_end:43), ExecutorInfo(task_start:23, task_end:23), ExecutorInfo(task_start:11, task_end:11), ExecutorInfo(task_start:19, task_end:19), ExecutorInfo(task_start:15, task_end:15), ExecutorInfo(task_start:24, task_end:24), ExecutorInfo(task_start:12, task_end:12), ExecutorInfo(task_start:8, task_end:8), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:32, task_end:32), ExecutorInfo(task_start:40, task_end:40), ExecutorInfo(task_start:36, task_end:36), ExecutorInfo(task_start:28, task_end:28), ExecutorInfo(task_start:20, task_end:20), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:48, task_end:48), ExecutorInfo(task_start:44, task_end:44), ExecutorInfo(task_start:21, task_end:21), ExecutorInfo(task_start:33, task_end:33), ExecutorInfo(task_start:41, task_end:41), ExecutorInfo(task_start:37, task_end:37), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:9, task_end:9), ExecutorInfo(task_start:13, task_end:13), ExecutorInfo(task_start:17, task_end:17), ExecutorInfo(task_start:5, task_end:5), ExecutorInfo(task_start:29, task_end:29), ExecutorInfo(task_start:25, task_end:25), ExecutorInfo(task_start:45, task_end:45), ExecutorInfo(task_start:49, task_end:49)], resources:WorkerResources(mem_on_heap:0.0, mem_off_heap:0.0, cpu:0.0), owner:root) to null","thread_name":"SLOT_6700","level":"WARN"}
{"@timestamp":"2020-01-09 11:18:57,724","message":"Killing 29a1f333-55f1-45c2-988d-daf0712c2862:5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:00,808","message":"STATE RUNNING msInState: 120187 topo:trident-Topology-578320979 worker:5e19382e-c3e5-4c8d-8706-185e00e658a8 -> KILL msInState: 0 topo:trident-Topology-578320979 worker:5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:00,809","message":"GET worker-user for 5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:00,828","message":"SLOT 6700 force kill and wait...","thread_name":"SLOT_6700","level":"WARN"}
{"@timestamp":"2020-01-09 11:19:00,831","message":"Force Killing 29a1f333-55f1-45c2-988d-daf0712c2862:5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:01,432","message":"Worker Process 5e19382e-c3e5-4c8d-8706-185e00e658a8 exited with code: 137","thread_name":"Thread-30","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,851","message":"GET worker-user for 5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,858","message":"SLOT 6700 all processes are dead...","thread_name":"SLOT_6700","level":"WARN"}
{"@timestamp":"2020-01-09 11:19:03,859","message":"Cleaning up 29a1f333-55f1-45c2-988d-daf0712c2862:5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,859","message":"GET worker-user for 5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,859","message":"Deleting path /data/workers/5e19382e-c3e5-4c8d-8706-185e00e658a8/pids/3100","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,860","message":"Deleting path /data/workers/5e19382e-c3e5-4c8d-8706-185e00e658a8/heartbeats","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,871","message":"Deleting path /data/workers/5e19382e-c3e5-4c8d-8706-185e00e658a8/pids","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,872","message":"Deleting path /data/workers/5e19382e-c3e5-4c8d-8706-185e00e658a8/tmp","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,872","message":"Deleting path /data/workers/5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,873","message":"REMOVE worker-user 5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,874","message":"Deleting path /data/workers-users/5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,876","message":"Removed Worker ID 5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,876","message":"STATE KILL msInState: 3068 topo:trident-Topology-578320979 worker:null -> EMPTY msInState: 0","thread_name":"SLOT_6700","level":"INFO"}
After this worker with id 5e19382e-c3e5-4c8d-8706-185e00e658a8
was killed, a new worker process was launched by supervisor with different Id and loading of executors starts again and then after some of the executors have done loading, the worker process will receive a kill signal from supervisor.
Following are the worker logs at port 6700,
...
2020-01-09 14:42:19.455 o.a.s.d.executor main [INFO] Loading executor b-14:[10 10]
2020-01-09 14:42:20.942 o.a.s.d.executor main [INFO] Loaded executor tasks b-14:[10 10]
2020-01-09 14:42:20.945 o.a.s.d.executor main [INFO] Finished loading executor b-14[10 10]
2020-01-09 14:42:20.962 o.a.s.d.executor main [INFO] Loading executor b-39:[37 37]
2020-01-09 14:42:22.547 o.a.s.d.executor main [INFO] Loaded executor tasks b-39:[37 37]
2020-01-09 14:42:22.549 o.a.s.d.executor main [INFO] Finished loading executor b-39:[37 37]
2020-01-09 14:42:22.566 o.a.s.d.executor main [INFO] Loading executor b-5:[46 46]
2020-01-09 14:42:25.267 o.a.s.d.executor main [INFO] Loaded executor tasks b-5:[46 46]
2020-01-09 14:42:25.269 o.a.s.d.executor main [INFO] Finished loading executor b-5:[46 46]
2020-01-09 14:42:31.175 o.a.s.d.executor main [INFO] Loading executor b-0:[4 4]
2020-01-09 14:42:37.512 o.s.c.n.e.InstanceInfoFactory Thread-10 [INFO] Setting initial instance status as: STARTING
2020-01-09 14:42:37.637 o.s.s.c.ThreadPoolTaskScheduler [Ljava.lang.String;@174cb0d8.container-0-C-1 [INFO] Shutting down ExecutorService
2020-01-09 14:42:37.851 o.s.k.l.KafkaMessageListenerContainer$ListenerConsumer [Ljava.lang.String;@174cb0d8.container-0-C-1 [INFO] Consumer stopped
2020-01-09 14:42:37.855 o.s.i.k.i.KafkaMessageDrivenChannelAdapter Thread-10 [INFO] stopped org.springframework.integration.kafka.inbound.KafkaMessageDrivenChannelAdapter@2459333a
2020-01-09 14:42:37.870 o.s.s.c.ThreadPoolTaskScheduler [Ljava.lang.String;@6e355249.container-0-C-1 [INFO] Shutting down ExecutorService
2020-01-09 14:42:38.054 o.s.k.l.KafkaMessageListenerContainer$ListenerConsumer [Ljava.lang.String;@6e355249.container-0-C-1 [INFO] Consumer stopped
After this, it will again start with 'Launching worker for trident-Topology-578320979 ...' and loading all the executors and tasks.
Can anyone please explain what does "Worker Process 5e19382e-c3e5-4c8d-8706-185e00e658a8 exited with code: 137"
mean?
Following link [https://issues.apache.org/jira/browse/STORM-2176], explains that the configuration property supervisor.worker.shutdown.sleep.secs
, which is set by default to 1 second. This corresponds to how long the supervisor will wait for a worker to exit gracefully before forcibly killing it with kill -9. When this happens the supervisor will log that the worker terminated with exit code 137 (128 + 9).
Would it help increasing the value of supervisor.worker.shutdown.sleep.secs
?
Or Can it be because JVM doesn't have enough Memory? But then, It should throw Exception in thread "main" java.lang.OutOfMemoryError: Java heap space, whereas in any of the logs, there is no such exception visible.
Is it recommended to try by increasing the JVM memory using the configuration settings ('worker.childopts') in storm.yaml.
Any help would be greatly appreciated.
P.S. Trying to find out solution since few days but no success.