For Spark streaming with kafka we have Directstream which is receiver less approach and maps the kafka partitions to spark RDD partitions. Currently we have a application in which we use Kafka Direct approach and maintain our on offsets in RDBMS,
Do we have one similar for Kinesis ? when i read documentation of spark-Kinesis integration it feels like there is difference in checkpointing. Below are some of the questions i have
- Does Streaming with kinesis map kinesis shards to RDD partitions ? Can i maintain ordered processing at shard level if i use forEachPartition on incoming RDD?
- From Documentation it explains that kinesis maintains separate checkpoints in dynamoDB? Cant we ignore it and use our own offset management ?
- In KinesisUtils.createStream api i see that for [initial position] variable it takes only LATEST or TRIM_HORIZON. In that case how can i will not be able to provide map of shard to offset as i provide in kafka case ?
How can we get exactly once processing if our application is idempotent?