2

My Spark Streaming application (Spark 2.0 , on AWS EMR yarn cluster) listens to Campaigns based on live stock feeds and the batch duration is 5 seconds. The applications uses Kafka DirectStream and based on the feed source there are three streams. As given in the code snippet I am doing a union of three streams and I am trying to remove the duplicate campaigns received using reduceByKey based on the customer and campaignId. I could see lot of duplicate email being send out for the same key in the same batch.I was expecting reduceByKey to remove the duplicate campaigns in a batch based on customer and campaignId. In logs I am even printing the the key,batch time before sending the email and I could clearly see duplicates. I could see some duplicates getting removed after adding log in reduceByKey Function, but its not eliminating completely .

JavaDStream<Campaign> matchedCampaigns = stream1.transform(CmpManager::getMatchedCampaigns)
            .union(stream2).union(stream3).cache();

JavaPairDStream<String, Campaign> uniqueCampaigns = matchedCampaigns.mapToPair(campaign->{
        String key=campaign.getCustomer()+"_"+campaign.getId();
        return new Tuple2<String, Campaigns>(key, campaign);
    })
.reduceByKey((campaign1, campaign2)->{return campaign1;});

uniqueCampaigns.foreachRDD(CmpManager::sendEmail);

I am not able to figure out where I am going wrong here . Please help me here to get rid of this weird problem. Previously we were using createStream for listening to Kafka Queue (number of partitions 1) , there we didn't face this issue. But when we moved to directStream (number of partitions 100) we could easily reproduce this issue on high load .

Note: I even tried reduceByKeyAndWindow with duration of 5 seconds instead of reduceByKey Operation, But even that didn't help.

Dev Loper
  • 209
  • 1
  • 4
  • 18
  • I think the problem is that you want to remove duplicates in a multiset that is not defined (since you set no boundaries like a window). What you actually want is a sliding window (time or tuple) in which you can remove the duplicates: `uniqueCampaigns.reduceByKeyAndWindow((c1,c2)=>c1, Seconds(10), Seconds(1))` However, since Spark can not handle event times, the result of this operation is still not deterministic and thus you will still get duplicates even if the events in the stream happens in the used window length. – Christian Kuka Nov 12 '16 at 05:39
  • What do you mean by multiset ? I event tried uniqueCampaigns.reduceByKeyAndWindow((c1,c2)=>c1, Seconds(5), Seconds(5)) . it didn't work . Is there any way I could remove the duplicates using Spark alone ? – Dev Loper Nov 12 '16 at 09:13

0 Answers0