Total number of non repeated words in each tweet

Question

I'm new to java and Trident , I imported project for getting tweets but i want to get something How this code get more than one tweet as i got form the code that tuple.getValue(0); means first tweet only ?!

Problem with me to get all tweets in hashset or hashmap to get total number of distnictive words in each tweet

public void execute(TridentTuple tuple, TridentCollector collector) {

this method is used to execute equations on tweet

public Values getValues(Tweet tweet, String[] words){
 }

This code got first tweet then get body of it ,converting it to array of string , i know what i need to solve but couldn't write it well

My Think : Make for loop like

for (int i=0;i<10;i++)
{
 Tweet tweet = (Tweet) tuple.getValue(i);   
}

Are you aware of the basic properties of a `Set`? Hint: a `Set` does not allow duplicates. Extrapolate this to your need. — fge, Feb 16 '16 at 23:39
thanks for replying , no i didn't know that , but excuse me do you know about trident , i think i need to write like that str = br.readLine() will give me the line. str.split(" ") will give me the array of strings then iterate through each elements of array and store it inhashset but problem is to loop in tweets ! — user1, Feb 16 '16 at 23:41
Does it mean that you want to obtain a list of unique words across several tweets? Sorry, your problem is unclear. — fge, Feb 16 '16 at 23:47
i need to obtain total number of unique words in each tweet , edited post with examples — user1, Feb 16 '16 at 23:50
It is still very unclear. Your example code does not show at all that you collect the number of words of a tweet anywhere. — fge, Feb 16 '16 at 23:55
what about this lines Tweet tweet = (Tweet) tuple.getValue(0); String tweetBody = tweet.getBody(); String words[] = tweetBody.toLowerCase().split(regex); ? i have another method i wrote it's declaration above in the post if you need its details i will — user1, Feb 16 '16 at 23:57
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/103665/discussion-between-user1-and-fge). — user1, Feb 17 '16 at 01:34

score 0 · Answer 1 · answered Feb 17 '16 at 00:26

0

For each tweet:
- For each word in tweet:
  - Try adding each word to a set.
    If the word already exists in the set, remove it from the set.
- count size of set containing words for that tweet.

answered Feb 17 '16 at 00:26

Moshe Stauber

401
2
11

thanks for replying i get this well but problem with me is how to loop in tweets to get each tweet as i tried this for (int i=0;i<10;i++) { Tweet tweet = (Tweet) tuple.getValue(i); } but didn't work – user1 Feb 17 '16 at 00:28
Use tuple.getValues(), which returns a list of objects. Iterate over that list instead of trying to retrieve each tweet separately. – Moshe Stauber Feb 17 '16 at 00:36
thanks , do you mean like that for (int i=0;i<10;i++) { Object o= tuple.getValue(i); } but after that i need to cast each object to tweet like Tweet tweet = (Tweet) tuple.getValue(i) how can i write it ? – user1 Feb 17 '16 at 00:45
List tweetList = tuple.getvalues(); – Moshe Stauber Feb 17 '16 at 00:48
thanks but didn't work as tuple return object of tweet type i couldn't write it well like above method Tweet tweet = (Tweet) tuple.getValue(0) – user1 Feb 17 '16 at 00:56
after getting the entire list, iterate over it with a for each loop an cast it to a tweet as you go. ie `for(Object t : tweetList){ //tweet = (Tweet) t }` – Moshe Stauber Feb 17 '16 at 01:03
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/103662/discussion-between-user1-and-p1g1n). – user1 Feb 17 '16 at 01:13
sorry for late reply ,i'll try it later and get back to you – user1 Feb 17 '16 at 03:18

Matthias J. Sax · Answer 2 · 2016-02-17T12:48:56.367

The "problem" is a miss-match between "get the count of distinct words over all tweets" and Strom as a stream processor. The query you want to answer can only be computed on a finite set of Tweets. However, in stream processing you process an potential infinite stream of input data.

If you have a finite set of Tweets, you might want to use a batch processing framework such as Flink, Spark, or MapReduce. If you indeed have an infinite number of Tweets, you must rephrase your question...

As you mentioned already, you actually want to "loop over all Tweets". As you so stream processing, there is no such concept. You have an infinite number of input tuples, and Storm applies execute() on each of those (ie, you can think of it as if Storm "loops over the input" automatically -- even in "looping" is not the correct term for it). As your computation is "over all Tweets" you would need to maintain a state in your Bolt code, such that you can update this state for each Tweet. The simples form of a state in Storm would be member variable in your Bolt class.

public class MyBolt implements ??? {
    // this is your "state" variable
    private final Set<String> allWords = new HashSet<String>();

    public void execute(TridentTuple tuple, TridentCollector collector) {
        Tweet tweet = (Tweet)tuple.getValue(0);        
        String tweetBody = tweet.getBody();
        String words[] = tweetBody.toLowerCase().split(regex);
        for(String w : words) {
            // as allWords is a set, you cannot add the same word twice
            // the second "add" call on the same word will just be ignored
           // thus, allWords will contain each word exactly once
            this.allWords.add(w);
        }
    }
}

Right now, this code does not emit anything, because it is unclear what you actually want to emit? As in stream processing, there is no end, you cannot say "emit the final count of words, contained in allWords". What you could do, it to emit the current count after each update... For this, add collector.emit(new Values(this.allWords.size())); at the end of execute().

Furthermore, I want to add, that the presented solution only works correctly, if no parallelism is applied to MyBolt -- otherwise, the different sets over the instances might contain the same word. To resolve this, it would be required to tokenize each Tweet into its words in a stateless Bolt and feet this streams of words into an adopted MyBolt that uses an internal Set as state. The input data for MyBolt must also receive the data via fieldsGrouping to ensure distinct sets of words on each instance.

thanks for great answer , i have 50000 tweets only not infinite and i used debug to know how can the code will get tweets and i got it by method execute after implemented it with getvalues method then emitted after that transfered to next tweet and so on .. there is another method used for calculation i will post it in the post now — user1, Feb 17 '16 at 14:55
Why do you use Storm if you want to do batch processing? I would strongly recommend to use a batch system like Flink, Spark, MapReduce... I cannot follow the rest of your comment though. — Matthias J. Sax, Feb 17 '16 at 15:23
this project using trident not actually storm , i posted the methed which he calculated number of occuerences of terms in the collection , what i need is to modify this equation only so i think i will not need to change more in code , Am i or i'm wrong ? — user1, Feb 17 '16 at 15:59
Trident is "just" an API abstraction on top of Storm which is the underlying execution engine; thus, it is still stream processing and seems not to fit your batch use case. -- I can't provide any other help here. — Matthias J. Sax, Feb 17 '16 at 16:04

Total number of non repeated words in each tweet

2 Answers2

Linked