Batch Size in Spark Streaming

Question

I am new to Spark and Spark Streaming. I am working on Twitter streaming data. My task involves dealing with each Tweet independently like counting the number of words in each tweet. From what I have read, each input batch forms on RDD in Spark Streaming. So if I give a batch interval of 2 seconds,then the new RDD contains all the tweets for two seconds and any transformation applied will apply to whole two sec data and there is no way to deal with individual tweets in that two seconds. Is my understanding correct? or else each tweet forms a new RDD? I am kind of confused...

score 2 · Accepted Answer · answered Jun 28 '15 at 01:49

2

In one batch you have a RDD containing all statuses that came in 2 seconds interval. Then you can process these statuses individually. Here is brief example:

 JavaDStream<Status> inputDStream = TwitterUtils.createStream(ctx, new OAuthAuthorization(builder.build()), filters);

      inputDStream.foreach(new Function2<JavaRDD<Status>,Time,Void>(){
            @Override
            public Void call(JavaRDD<Status> status, Time time) throws Exception {
                List<Status> statuses=status.collect();
                for(Status st:statuses){
                     System.out.println("STATUS:"+st.getText()+" user:"+st.getUser().getId());                      
                //Process and store status somewhere
                }
                return null;
            }});         
     ctx.start();
        ctx.awaitTermination();      
}

I hope I didn't misunderstand your question.

Zoran

answered Jun 28 '15 at 01:49

zoran jeremic

2,046
4
21
47

Thank you. If I store statuses individually in a list, can I apply all RDD transformations or actions like reduceByKey(),countByValue on the list? Though I am new to Scala, I need to do it in Scala. – Naren Jun 28 '15 at 16:51
I just gave you an example with list to show you that you can access individual statuses, but if you want to use spark to further process it, you should not collect statuses to list. For example, you can implement inputDStream.mapToPair function that will return statuses by some keys, e.g. user id or whatever you need. Then you can reduceByKey. Unfortunately, I have only basic knowledge of Scala, and can't give you example, but everything you can do in Java, you can do in Scala as well. – zoran jeremic Jun 28 '15 at 20:27
I thought may be I can store the statuses of a specific batch in a list and convert that list into RDD using parallelize() so that I can apply Spark transformations and actions. – Naren Jun 28 '15 at 21:47
If you look at the previous example, you will notice that these statuses coming in one batch are already RDD, so there is no need to convert it to the list like I did in example, and then parallelize again. If you need Spark transformation over the statuses in one batch, then you can do it directly on the RDD status within function call method. So if you want to reduceByKey, you should first PairRDD pairs=status.mapToPair..., and then pairs.reduceByKey... – zoran jeremic Jun 28 '15 at 22:19

Batch Size in Spark Streaming

1 Answers1