The "problem" is a miss-match between "get the count of distinct words over all tweets" and Strom as a stream processor. The query you want to answer can only be computed on a finite set of Tweets. However, in stream processing you process an potential infinite stream of input data.
If you have a finite set of Tweets, you might want to use a batch processing framework such as Flink, Spark, or MapReduce. If you indeed have an infinite number of Tweets, you must rephrase your question...
As you mentioned already, you actually want to "loop over all Tweets". As you so stream processing, there is no such concept. You have an infinite number of input tuples, and Storm applies execute()
on each of those (ie, you can think of it as if Storm "loops over the input" automatically -- even in "looping" is not the correct term for it). As your computation is "over all Tweets" you would need to maintain a state in your Bolt code, such that you can update this state for each Tweet. The simples form of a state in Storm would be member variable in your Bolt class.
public class MyBolt implements ??? {
// this is your "state" variable
private final Set<String> allWords = new HashSet<String>();
public void execute(TridentTuple tuple, TridentCollector collector) {
Tweet tweet = (Tweet)tuple.getValue(0);
String tweetBody = tweet.getBody();
String words[] = tweetBody.toLowerCase().split(regex);
for(String w : words) {
// as allWords is a set, you cannot add the same word twice
// the second "add" call on the same word will just be ignored
// thus, allWords will contain each word exactly once
this.allWords.add(w);
}
}
}
Right now, this code does not emit anything, because it is unclear what you actually want to emit? As in stream processing, there is no end, you cannot say "emit the final count of words, contained in allWords
". What you could do, it to emit the current count after each update... For this, add collector.emit(new Values(this.allWords.size()));
at the end of execute()
.
Furthermore, I want to add, that the presented solution only works correctly, if no parallelism is applied to MyBolt
-- otherwise, the different sets over the instances might contain the same word. To resolve this, it would be required to tokenize each Tweet into its words in a stateless Bolt and feet this streams of words into an adopted MyBolt
that uses an internal Set
as state. The input data for MyBolt
must also receive the data via fieldsGrouping
to ensure distinct sets of words on each instance.