Configuring storm cluster for production cluster

Question

We have configured storm cluster with one nimbus server and three supervisors. Published three topologies which does different calculations as follows

Topology1 : Reads raw data from MongoDB, do some calculations and store back the result
Topology2 : Reads the result of topology1 and do some calculations and publish results to a queue
Topology3 : Consumes output of topology2 from the queue, calls a REST Service, get reply from REST service, update result in MongoDB collection, finally send an email.

As new bee to storm, looking for an expert advice on the following questions

Is there a way to externalize all configurations, for example a config.json, that can be referred by all topologies?

Currently configuration to connect MongoDB, MySql, Mq, REST urls are hard-coded in java file. It is not good practice to customize source files for each customer.

Wanted to log at each stage [Spouts and Bolts], Where to post/store log4j.xml that can be used by cluster?
Is it right to execute blocking call like REST call from a bolt?

Any help would be much appreciated.

score 0 · Answer 1 · answered Mar 22 '16 at 03:46

Since each topology is just a Java program, simply pass the configuration into the Java Jar, or pass a path to a file. The topology can read the file at startup, and pass any configuration to components as it instantiates them.
Storm uses slf4j out of the box, and it should be easy to use within your topology as such. If you use the default configuration, you should be able to see logs either through the UI, or dumped to disk. If you can't find them, there are a number of guides to help, e.g. http://www.saurabhsaxena.net/how-to-find-storm-worker-log-directory/.
With storm, you have the flexibility to push concurrency out to the component level, and get multiple executors by instantiating multiple bolts. This is likely the simplest approach, and I'd advise you start there, and later introduce the complexity of an executor inside of your topology for asynchronously making HTTP calls.

See http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html for the canonical overview of parallelism in storm. Start simple, and then tune as necessary, as with anything.

Thanks @Cory. Nice eye opener for #1 and #2. I am bit skeptical about #3 because, as I understand storm is a platform used to distributed computations, predominantly CPU bound jobs. The use-case #3, spout is simple Mq consumer and bolts triggers REST call with the data, which are IO bound calls. Looks like #3 is against the strength of storm (CPU bound computations). Please clarify. — Nageswara Rao, Mar 22 '16 at 06:24

Configuring storm cluster for production cluster

1 Answers1