2

For Spark streaming with kafka we have Directstream which is receiver less approach and maps the kafka partitions to spark RDD partitions. Currently we have a application in which we use Kafka Direct approach and maintain our on offsets in RDBMS,

Do we have one similar for Kinesis ? when i read documentation of spark-Kinesis integration it feels like there is difference in checkpointing. Below are some of the questions i have

  1. Does Streaming with kinesis map kinesis shards to RDD partitions ? Can i maintain ordered processing at shard level if i use forEachPartition on incoming RDD?
  2. From Documentation it explains that kinesis maintains separate checkpoints in dynamoDB? Cant we ignore it and use our own offset management ?
  3. In KinesisUtils.createStream api i see that for [initial position] variable it takes only LATEST or TRIM_HORIZON. In that case how can i will not be able to provide map of shard to offset as i provide in kafka case ?

How can we get exactly once processing if our application is idempotent?

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321

1 Answers1

1

Does Streaming with kinesis map kinesis shards to RDD partitions?

No, there is no 1:1 mapping between Kinesis shards and RDD partitions as stated in the documentation:

There is no correlation between the number of Kinesis stream shards and the number of RDD partitions/shards created across the Spark cluster during input DStream processing. These are 2 independent partitioning schemes


Can i maintain ordered processing at shard level if i use forEachPartition on incoming RDD?

Per created partition, order is maintained inside (not sure that helps):

Kinesis data processing is ordered per partition and occurs at-least once per message.


From Documentation it explains that kinesis maintains separate checkpoints in dynamoDB? Cant we ignore it and use our own offset management ?

No, you are bound by the Kinesis client implementation which uses DyanmoDB as a backing store.

In KinesisUtils.createStream api i see that for [initial position] variable it takes only LATEST or TRIM_HORIZON. In that case how can i will not be able to provide map of shard to offset as i provide in kafka case ?

No. There is no Kafka offset providing equivalent.

As you can see, the current implementation of the Kinesis API limits you. If you need the flexibility of offset storing and restoration and want to achieve exactly once semantics, consider going with Kafka for this solution as well.

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
  • Thanks for the Response. In that case kinesis is way beyond kafka. This will be big issue for my cloud migration. I have two critical applications which use exactly once spark streaming. – kalyan chakravarthy Apr 26 '17 at 04:18