What is the most efficient way of creating a unique list of incoming documents through Kafka when compared with those in ElasticSearch?

Question

In ElasticSearch, I will have an index type of RSS documents, each with their own hash.

Next, I have a scheduler that retrieves a list of RSS documents from a feed through Kafka Connect, to add as a microservices broker.

Using the BulkRequestBuilder or BulkProcessor, which option is best (I also read that the latter is preferable due to performance issues):

Add all incoming RSS documents to a list with a hash based on the title; iterate through the list and remove any document's that have a hash match of those in ES
Before adding a document to the list, check if its hash already exists in the ES db then add it to the list

There may be a better way as well, which I welcome.

Documents will be removed from Kafka once they have been consumed, so in this case would using Kafka Streams come into play? And now rather than doing the compare through a query of sorts, in the Kafka Producer code, we use the Exactly-Once, or does this go in the consumer code - something like that.

If I'm on the right track with this, can someone please elaborate?

score 0 · Accepted Answer · answered Jan 20 '17 at 22:58

With Bulk option, existing document id will be completely replaced with the incoming one, if that's ok for the use case, you don't have to do anything extra there.

Kafka can guarantee once delivery most of the time but not all the time provided that your producers are not dup producing the message, exception is to a few messages(potential dup delivery) could be during the rebalance events on the kafka cluster, and consumers should've a way to handle it.

Kafka on the other hand is different from other conventional brokers(JMS based), a message is not be deleted from kafka on consumption, that's driven by retention period setting per topic or generally. The good thing about it is that, you can always go back in-time to consume old messages or build new use-cases with a need to consume old messages.

What is the most efficient way of creating a unique list of incoming documents through Kafka when compared with those in ElasticSearch?

1 Answers1