1

Using Storm 1.2.2

Kafka 1.1.0

After submitting topology, supervisor launches a worker processes. When checking the worker.log file for that launched Worker Process, it was found out that, somewhere between loading of all the executors, worker process gets killed by supervisor.

Following are the supervisor logs,

{"@timestamp":"2020-01-09 11:18:57,719","message":"SLOT 6700: Assignment Changed from LocalAssignment(topology_id:trident-Topology-578320979, executors:[ExecutorInfo(task_start:22, task_end:22), ExecutorInfo(task_start:2, task_end:2), ExecutorInfo(task_start:42, task_end:42), ExecutorInfo(task_start:18, task_end:18), ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:14, task_end:14), ExecutorInfo(task_start:6, task_end:6), ExecutorInfo(task_start:38, task_end:38), ExecutorInfo(task_start:30, task_end:30), ExecutorInfo(task_start:34, task_end:34), ExecutorInfo(task_start:50, task_end:50), ExecutorInfo(task_start:46, task_end:46), ExecutorInfo(task_start:26, task_end:26), ExecutorInfo(task_start:39, task_end:39), ExecutorInfo(task_start:47, task_end:47), ExecutorInfo(task_start:7, task_end:7), ExecutorInfo(task_start:51, task_end:51), ExecutorInfo(task_start:3, task_end:3), ExecutorInfo(task_start:35, task_end:35), ExecutorInfo(task_start:31, task_end:31), ExecutorInfo(task_start:27, task_end:27), ExecutorInfo(task_start:43, task_end:43), ExecutorInfo(task_start:23, task_end:23), ExecutorInfo(task_start:11, task_end:11), ExecutorInfo(task_start:19, task_end:19), ExecutorInfo(task_start:15, task_end:15), ExecutorInfo(task_start:24, task_end:24), ExecutorInfo(task_start:12, task_end:12), ExecutorInfo(task_start:8, task_end:8), ExecutorInfo(task_start:4, task_end:4), ExecutorInfo(task_start:32, task_end:32), ExecutorInfo(task_start:40, task_end:40), ExecutorInfo(task_start:36, task_end:36), ExecutorInfo(task_start:28, task_end:28), ExecutorInfo(task_start:20, task_end:20), ExecutorInfo(task_start:16, task_end:16), ExecutorInfo(task_start:48, task_end:48), ExecutorInfo(task_start:44, task_end:44), ExecutorInfo(task_start:21, task_end:21), ExecutorInfo(task_start:33, task_end:33), ExecutorInfo(task_start:41, task_end:41), ExecutorInfo(task_start:37, task_end:37), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:9, task_end:9), ExecutorInfo(task_start:13, task_end:13), ExecutorInfo(task_start:17, task_end:17), ExecutorInfo(task_start:5, task_end:5), ExecutorInfo(task_start:29, task_end:29), ExecutorInfo(task_start:25, task_end:25), ExecutorInfo(task_start:45, task_end:45), ExecutorInfo(task_start:49, task_end:49)], resources:WorkerResources(mem_on_heap:0.0, mem_off_heap:0.0, cpu:0.0), owner:root) to null","thread_name":"SLOT_6700","level":"WARN"}

{"@timestamp":"2020-01-09 11:18:57,724","message":"Killing 29a1f333-55f1-45c2-988d-daf0712c2862:5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:00,808","message":"STATE RUNNING msInState: 120187 topo:trident-Topology-578320979 worker:5e19382e-c3e5-4c8d-8706-185e00e658a8 -> KILL msInState: 0 topo:trident-Topology-578320979 worker:5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:00,809","message":"GET worker-user for 5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:00,828","message":"SLOT 6700 force kill and wait...","thread_name":"SLOT_6700","level":"WARN"}
{"@timestamp":"2020-01-09 11:19:00,831","message":"Force Killing 29a1f333-55f1-45c2-988d-daf0712c2862:5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:01,432","message":"Worker Process 5e19382e-c3e5-4c8d-8706-185e00e658a8 exited with code: 137","thread_name":"Thread-30","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,851","message":"GET worker-user for 5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,858","message":"SLOT 6700 all processes are dead...","thread_name":"SLOT_6700","level":"WARN"}
{"@timestamp":"2020-01-09 11:19:03,859","message":"Cleaning up 29a1f333-55f1-45c2-988d-daf0712c2862:5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,859","message":"GET worker-user for 5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,859","message":"Deleting path /data/workers/5e19382e-c3e5-4c8d-8706-185e00e658a8/pids/3100","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,860","message":"Deleting path /data/workers/5e19382e-c3e5-4c8d-8706-185e00e658a8/heartbeats","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,871","message":"Deleting path /data/workers/5e19382e-c3e5-4c8d-8706-185e00e658a8/pids","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,872","message":"Deleting path /data/workers/5e19382e-c3e5-4c8d-8706-185e00e658a8/tmp","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,872","message":"Deleting path /data/workers/5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,873","message":"REMOVE worker-user 5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,874","message":"Deleting path /data/workers-users/5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,876","message":"Removed Worker ID 5e19382e-c3e5-4c8d-8706-185e00e658a8","thread_name":"SLOT_6700","level":"INFO"}
{"@timestamp":"2020-01-09 11:19:03,876","message":"STATE KILL msInState: 3068 topo:trident-Topology-578320979 worker:null -> EMPTY msInState: 0","thread_name":"SLOT_6700","level":"INFO"}

After this worker with id 5e19382e-c3e5-4c8d-8706-185e00e658a8 was killed, a new worker process was launched by supervisor with different Id and loading of executors starts again and then after some of the executors have done loading, the worker process will receive a kill signal from supervisor.

Following are the worker logs at port 6700,

...
2020-01-09 14:42:19.455 o.a.s.d.executor main [INFO] Loading executor b-14:[10 10]
2020-01-09 14:42:20.942 o.a.s.d.executor main [INFO] Loaded executor tasks b-14:[10 10]
2020-01-09 14:42:20.945 o.a.s.d.executor main [INFO] Finished loading executor b-14[10 10]
2020-01-09 14:42:20.962 o.a.s.d.executor main [INFO] Loading executor b-39:[37 37]
2020-01-09 14:42:22.547 o.a.s.d.executor main [INFO] Loaded executor tasks b-39:[37 37]
2020-01-09 14:42:22.549 o.a.s.d.executor main [INFO] Finished loading executor b-39:[37 37]
2020-01-09 14:42:22.566 o.a.s.d.executor main [INFO] Loading executor b-5:[46 46]
2020-01-09 14:42:25.267 o.a.s.d.executor main [INFO] Loaded executor tasks b-5:[46 46]
2020-01-09 14:42:25.269 o.a.s.d.executor main [INFO] Finished loading executor b-5:[46 46]
2020-01-09 14:42:31.175 o.a.s.d.executor main [INFO] Loading executor b-0:[4 4]
2020-01-09 14:42:37.512 o.s.c.n.e.InstanceInfoFactory Thread-10 [INFO] Setting initial instance status as: STARTING
2020-01-09 14:42:37.637 o.s.s.c.ThreadPoolTaskScheduler [Ljava.lang.String;@174cb0d8.container-0-C-1 [INFO] Shutting down ExecutorService
2020-01-09 14:42:37.851 o.s.k.l.KafkaMessageListenerContainer$ListenerConsumer [Ljava.lang.String;@174cb0d8.container-0-C-1 [INFO] Consumer stopped
2020-01-09 14:42:37.855 o.s.i.k.i.KafkaMessageDrivenChannelAdapter Thread-10 [INFO] stopped org.springframework.integration.kafka.inbound.KafkaMessageDrivenChannelAdapter@2459333a
2020-01-09 14:42:37.870 o.s.s.c.ThreadPoolTaskScheduler [Ljava.lang.String;@6e355249.container-0-C-1 [INFO] Shutting down ExecutorService
2020-01-09 14:42:38.054 o.s.k.l.KafkaMessageListenerContainer$ListenerConsumer [Ljava.lang.String;@6e355249.container-0-C-1 [INFO] Consumer stopped

After this, it will again start with 'Launching worker for trident-Topology-578320979 ...' and loading all the executors and tasks.

Can anyone please explain what does "Worker Process 5e19382e-c3e5-4c8d-8706-185e00e658a8 exited with code: 137" mean?

Following link [https://issues.apache.org/jira/browse/STORM-2176], explains that the configuration property supervisor.worker.shutdown.sleep.secs, which is set by default to 1 second. This corresponds to how long the supervisor will wait for a worker to exit gracefully before forcibly killing it with kill -9. When this happens the supervisor will log that the worker terminated with exit code 137 (128 + 9).

Would it help increasing the value of supervisor.worker.shutdown.sleep.secs?

Or Can it be because JVM doesn't have enough Memory? But then, It should throw Exception in thread "main" java.lang.OutOfMemoryError: Java heap space, whereas in any of the logs, there is no such exception visible.

Is it recommended to try by increasing the JVM memory using the configuration settings ('worker.childopts') in storm.yaml.

Any help would be greatly appreciated.

P.S. Trying to find out solution since few days but no success.

DP63
  • 67
  • 1
  • 11
  • 1
    Anything in the worker logs? – Chris Gerken Jan 09 '20 at 14:30
  • 1
    Consider trying your topology with the latest version of Storm instead of an older version. Your logs are showing that the supervisor is killing the topology for some reason. The 137 code is simply saying that the supervisor decided to force kill instead of gracefully shutdown the worker, but the problem here is that the supervisor decides to shut the worker down at all. – Stig Rohde Døssing Jan 11 '20 at 18:33
  • @StigRohdeDøssing Tried updating storm to 2.1 and it works now.!!! But why does it has problem with lower version? It was working before few days and it stopped suddenly. – DP63 Jan 13 '20 at 12:14
  • Not sure, but there were some fixes to the supervisor in later versions of Storm, and would rather not spend time debugging an old version if it is fixed in the latest version. – Stig Rohde Døssing Jan 13 '20 at 16:04
  • I still face this behavior using Apache Storm v.2.2.0.. any suggestions? – moosehead42 Oct 24 '22 at 20:03

0 Answers0