2

I have a Spark Streaming program running in local mode in which I receive JSON messages from a TCP socket connection, several per batch interval.

Each of these messages has an ID, which I use to create a key/value JavaPairDStream, such that in each partition of the RDD inside my JavaDStream, there is a key/value pair, with a single message per partition.

My goal now, is to group the messages that have the same ID together in the same partition, so that I can map them in parallel, each partition being processed by a different core.

Following is my code:

JavaReceiverInputDStream<String> streamData2 = ssc.socketTextStream(args[0], Integer.parseInt(args[1]),
            StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<String>streamData1=streamData2.repartition(1);

JavaPairDStream<String,String> streamGiveKey= streamData1.mapPartitionsToPair(new PairFlatMapFunction<Iterator<String>, String, String>() {
        @Override
        public Iterable<Tuple2<String, String>> call(Iterator<String> stringIterator) throws Exception {

            ArrayList<Tuple2<String,String>>a= new ArrayList<Tuple2<String, String>>();

            while (stringIterator.hasNext()){
                String c=stringIterator.next();
                if(c==null){
                    return null;

                }

                JsonMessage retMap = new Gson().fromJson(c,JsonMessage.class);
                String key= retMap.getSid();
                Tuple2<String,String> b= new Tuple2<String,String>(key,c);
                a.add(b);

            }

            return a;
        }
    });

So, at the end of this code, I have a DStream with an RDD that has only one partition because of repartition{1}, with all the key/value pairs inside it.

How should I proceed now to group the messages that have the same key, and put them into different partitions so I can map them separately?

Bilesh Ganguly
  • 3,792
  • 3
  • 36
  • 58
manuel mourato
  • 801
  • 1
  • 12
  • 36
  • Does my previous answer to the other question cover this http://stackoverflow.com/questions/37908890/how-to-group-key-values-by-partition-in-spark ? If not, please let us know what more you need here. – WestCoastProjects Jun 19 '16 at 16:59
  • Yes, it does, thank you so much. I actually have another question regarding this, but I will create a new one. Thank you. – manuel mourato Jun 19 '16 at 17:08

1 Answers1

0

This question was - according to the OP - addressed by a simultaneous answer to another question here: How to group key/values by partition in Spark?

Community
  • 1
  • 1
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560