You will need to manage unique keys manually: but given that approach it is possible when using
KafkaUtils.createDirectStream
From the Spark docs http://spark.apache.org/docs/latest/streaming-kafka-integration.html :
Approach 2: Direct Approach (No Receivers)
each record is received by Spark Streaming
effectively exactly once despite failures.
And here is the idempotency
requirement - so e.g. saving unique key per message in Postgres
:
In order to achieve
exactly-once semantics for output of your results, your output
operation that saves the data to an external data store must be either
idempotent, or an atomic transaction that saves results and offsets
(see Semantics of output operations in the main programming guide for
further information).
Here is an idea of the kind of code you would need to manage the unique keys (from http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/ ):
stream.foreachRDD { rdd =>
rdd.foreachPartition { iter =>
// make sure connection pool is set up on the executor before writing
SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
iter.foreach { case (key, msg) =>
DB.autoCommit { implicit session =>
// the unique key for idempotency is just the text of the message itself, for example purposes
sql"insert into idem_data(msg) values (${msg})".update.apply
}
}
}
}
A unique per-message ID would need to be managed.