2

I want to compute with Storm the mean from incoming tuples made of [int id,int value]. As you can see i can't partition the data by using a fields grouping. I need a topology architecture to distribute this computation and the only way im thinking of is doing mini batches within each bolt instances and then aggregate.

I kind of understood that trident was the appropriate solution to do mini-batch processing within storm.

What is the best practice to compute global analytics with storm like means, global count, std-devs when you can't partition the data based on attribute? Any topology example?

simon
  • 21
  • 2
  • It all depends on how you are going to group things to calculate the mean. Since a Storm topology is designed to handle a continuous stream of data, you must first decide how to group the data together to calculate the mean: over the life of the topology, a time window, something else? – Gordon Seidoh Worley Aug 09 '13 at 14:01
  • A very large time window, like a day, meaning that it must process millions of tuples. – simon Aug 11 '13 at 21:06

1 Answers1

2

You can easily compute stream statistics such as mean, standard deviation and count computed using Trident-ML. There's a section in the README which explains how to compute theses stats within a trident topology.

Hope it helps.

Pierre Merienne
  • 391
  • 3
  • 6