2

I am trying to test and Storm+Kafka+Trident job on a multi-node Storm cluster.

When I run my job in machine 1, The job runs and records are processed When I run my job after adding a second worker, then also the job runs without any problems.

The problem starts when I add third worker to the cluster. I get the below in the worker log

2014-07-16 16:47:56 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-cassandra1/10.201.221.139:6701... [29]
2014-07-16 16:47:56 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-cassandra1/10.201.221.139:6703... [30]
2014-07-16 16:47:57 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-cassandra1/10.201.221.139:6702... [30]
2014-07-16 16:47:57 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-cassandra1/10.201.221.139:6700... [29]
2014-07-16 16:47:57 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-cassandra1/10.201.221.139:6701... [30]
2014-07-16 16:47:57 b.s.m.n.Client [INFO] Closing Netty Client Netty-Client-cassandra1/10.201.221.139:6703
2014-07-16 16:47:57 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-cassandra1/10.201.221.139:6703..., timeout: 600000ms, pendings: 0
2014-07-16 16:47:58 b.s.m.n.Client [INFO] Closing Netty Client Netty-Client-cassandra1/10.201.221.139:6702
2014-07-16 16:47:58 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-cassandra1/10.201.221.139:6702..., timeout: 600000ms, pendings: 0
2014-07-16 16:47:58 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-cassandra1/10.201.221.139:6700... [30]
2014-07-16 16:48:31 s.k.KafkaUtils [INFO] Metrics Tick: Not enough data to calculate spout lag.
2014-07-16 16:48:34 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-172.144.96.66.static.eigbox.net/66.96.144.172:6701... [6]
2014-07-16 16:48:34 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-172.144.96.66.static.eigbox.net/66.96.144.172:6703... [6]

In the supervisor log I get the below message

2014-07-16 16:47:26 b.s.d.supervisor [INFO] 1fdb9a02-1110-458c-b72e-91950fbbc5fd still hasn't started
2014-07-16 16:47:27 b.s.d.supervisor [INFO] 1fdb9a02-1110-458c-b72e-91950fbbc5fd still hasn't started
2014-07-16 16:47:27 b.s.d.supervisor [INFO] 1fdb9a02-1110-458c-b72e-91950fbbc5fd still hasn't started
2014-07-16 16:47:28 b.s.d.supervisor [INFO] 1fdb9a02-1110-458c-b72e-91950fbbc5fd still hasn't started
2014-07-16 16:47:28 b.s.d.supervisor [INFO] 1fdb9a02-1110-458c-b72e-91950fbbc5fd still hasn't started
2014-07-16 16:47:29 b.s.d.supervisor [INFO] 1fdb9a02-1110-458c-b72e-91950fbbc5fd still hasn't started
2014-07-16 16:47:29 b.s.d.supervisor [INFO] 1fdb9a02-1110-458c-b72e-91950fbbc5fd still hasn't started
2014-07-16 16:47:30 b.s.d.supervisor [INFO] 1fdb9a02-1110-458c-b72e-91950fbbc5fd still hasn't started

The job doesn't run at all. My storm.yaml config goes like this

storm.zookeeper.servers:
- "10.201.32.79"
# 
nimbus.host: "10.201.32.79"
storm.local.dir: "/home/hadoop/stormtmp"
java.library.path: "/opt/java7/lib"
#supervisor.slots.ports:
#    - 6700
#    - 6701
#    - 6702
#    - 6703
worker.childopts: "-Xmx2048m -XX:NewSize=1000m -XX:MaxNewSize=1000m"
nimbus.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"
supervisor.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"
ui.port: 8084
ui.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
subbu
  • 65
  • 6

1 Answers1

1

it is basically saying supervisor are not able to launch workers .. try to see in the supervisor log that says something like
b.s.d.supervisor [INFO] Launching worker with command: java -server .....
Now copy this command and try to run it on your supervisor and see if you faced any error and if so you probably need to configure your storm.yaml accordingly

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
user2720864
  • 8,015
  • 5
  • 48
  • 60
  • 1
    I was able to resolve it with Storm version 0.9.2. Storm 0.9.1 has a known bug which makes it to malfunction like this JIRA-187 has been resolved in Storm 0.9.2. Also I increased the netty min wait ms to 4000ms and max wait ms to 10000ms. This seems to have done the trick. Thanks anyways – subbu Jul 21 '14 at 09:25
  • storm.messaging.netty.max_retries=100 storm.messaging.netty.max_wait_ms=1200000 this solved the problem for me. Netty is very sensitive to the timeouts, and if not handle it properly, it will crash the workers and supervisor will restart them. – linehrr Jan 03 '18 at 23:02