In ElasticSearch, I will have an index type of RSS documents, each with their own hash.
Next, I have a scheduler that retrieves a list of RSS documents from a feed through Kafka Connect, to add as a microservices broker.
Using the BulkRequestBuilder or BulkProcessor, which option is best (I also read that the latter is preferable due to performance issues):
- Add all incoming RSS documents to a list with a hash based on the title; iterate through the list and remove any document's that have a hash match of those in ES
- Before adding a document to the list, check if its hash already exists in the ES db then add it to the list
There may be a better way as well, which I welcome.
Documents will be removed from Kafka once they have been consumed, so in this case would using Kafka Streams come into play? And now rather than doing the compare through a query of sorts, in the Kafka Producer code, we use the Exactly-Once, or does this go in the consumer code - something like that.
If I'm on the right track with this, can someone please elaborate?