1

For a simple word count program in the storm-starter, the logic is fairly straight-forward:
1) split sentence into words
2) emit each word
3) aggregate the count (store the count in a map)

However, there are two problems here:
1) the program uses 12 individual threads to execute the aggregation part, which means the count is not GLOBAL, we have to add one more layer to get the global count?
2) in the bolt, maps are used to store the count, which means it has state, what if the current worker fails, all counts stored in the bolt are gone? since storm is stateless
3) should we use Trident to achieve this instead?

Leo Li
  • 19
  • 3

1 Answers1

1

Each bolt contains 1/12th of the words for the global state. The fields grouping sends specific words to the same bolt each time so the counts are accurate globally.

https://storm.apache.org/documentation/Concepts.html

Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks.

Yes, the counts would be lost if the node crashed. Persistent storage should be used in accordance with your application's tolerance to inaccuracy and required performance characteristics.

Trident helps you build states that do exactly once processing (counting in this example). If the backing map in the example was HBase, it would be resilient to bolt crashes, but you would either lose data when the bolt restarted (best effort processing), or over count words if the sentence tuple was replayed (at least once processing). If you need to count things once, Trident is the way to go.

Joshua Martell
  • 7,074
  • 2
  • 30
  • 37
  • Hi Joshua, thanks for the answer, I understand the first question that the same word will go to same partition to achieve global count. For the second question, **"If the backing map in the example was HBase, it would be resilient to bolt crashes, but you would either lose data when the bolt restarted (best effort processing), or over count words if the sentence tuple was replayed (at least once processing)."** Do you mean we still have data loss/over count even use Trident and backup with HBase? – Leo Li Apr 14 '15 at 20:41
  • I figured out myself, it depends which kind of map you use: transactional, non-transactional and opaque, basically a trade-off between fault-tolerance and storage costs – Leo Li Apr 15 '15 at 01:16
  • I was talking about regular Storm in that case regarding how under/overcounting could happen. Trident's Opaque and Transactional map states provide exactly once counting with HBase (or other persistent storage). Which you should use depends on if your spout can replay exactly the same set of messages as it did before when a failure occurs. – Joshua Martell Apr 15 '15 at 13:48