2

I have a streaming job consuming from Kafka (using createDstream). its stream of "id"

[id1,id2,id3 ..]

I have an utility or an api which accepts an Array of id's and does some external call and receives back some info say "t" for each id

[id:t1,id2:t2,id3:t3...]

I want to retain the DStream while calling the utility to retain Dstream. I can't use map transformation on Dstream rdd as it will make a call for each id, and moreover the utility is accepting a collection of id's.

Dstream.map(x=> myutility(x)) -- ruled out

And if I use

Dstream.foreachrdd(rdd=> myutility(rdd.collect.toarray))

I lose the DStream. I need to retain DStream for downstream processing.

  • Redesign `myutility` so it can correctly work in parallel? Having single local collection in Spark is no go. – user7337271 Jan 06 '17 at 14:18
  • @user7337271 in parallel is achieved by below Dstream.foreachrdd(rdd=> myutility(rdd.collect.toarray)) but losing DStream – Rushabh Mehta Jan 06 '17 at 14:28
  • There is no parallelism here. Entire body `foreachrdd(rdd=> myutility(rdd.collect.toarray))` is executed locally on the driver. You could `transform(rdd=> sc.parallelize(myutility(rdd.collect.toarray)))` but it __doesn't resolve__ this problem. – user7337271 Jan 06 '17 at 14:33
  • @user7337271 you are right, I made a wrong assumption – Rushabh Mehta Jan 09 '17 at 15:09

1 Answers1

4

The approach to achieve external bulk calls is to directly transform the RDDs in the DStream at the partition level.

The pattern looks like this:

val transformedStream = dstream.transform{rdd => 
    rdd.mapPartitions{iterator => 
      val externalService = Service.instance() // point to reserve local resources or make server connections.
      val data = iterator.toList // to act in bulk. Need to tune partitioning to avoid huge data loads at this level
      val resultCollection = externalService(data)
      resultCollection.iterator
    }
 }

This approach process each partition of the underlying RDD in parallel using the resources available in the cluster. Note that the connection to the external system needs to be instantiated for each partition (and not for each element).

maasg
  • 37,100
  • 11
  • 88
  • 115