0

We are building an java application which will use embedded Neo4j for graph traversal. Below are the reasons why we want to use embedded version instead of centralized server

  1. This app is not a data owner. Data will be ingested on it through other app. Keeping data locally will help us in doing quick calculation and hence it will improve our api sla.
  2. Since data foot print is small we don't want to maintain centralized server which will incur additional cost and maintenance.
  3. No need for additional cache

Now this architecture bring two challenges. First How to update data in all instance of embedded Neo4j application at same time. Second how to make sure that all instance are in sync i.e using same version of data.

We thought of using Kafka to solve first problem. Idea is to have kafka listener with different groupid(to ensure all get updates) in all instance . Whenever there is update, event will be posted in kafka. All instance will listen for event and will perform the update operation.

However we still don't have any solid design to solve second problem. For various reason one of the instance can miss the event (it's consumer is down). One of the way is to keep checking latest version by calling api of data owner app. If version is behind replay the events.But this brings additional complexity of maintaining the event logs of all updates. Do you guys think if it can be done in a better and simpler way?

Rishi Saraf
  • 1,644
  • 2
  • 14
  • 27

1 Answers1

0

Kafka consumers are extremely consistent and reliable once you have them configured properly, so there shouldn't be any reason for them to miss messages, unless there's an infrastructure problem, in which case any solution you architect will have problems. If the Kafka cluster is healthy (e.g. at least one of the copies of the data is available, and at least quorum zookeepers are up and running), then your consumers should receive every single message from the topics they're subscribed to. The consumer will handle the retries/reconnecting itself, as long as your timeout/retry configurations are sane. The default configs in the latest kafka versions are adequate 99% of the time.

Separately, you can add a separate thread, for example, that is constantly checking what the latest offset is per topic/partitions, and compare it to what the consumer has last received, and maybe issue an alert/warning if there is a discrepancy. In my experience, and with Kafka's reliability, it should be unnecessary, but it can give you peace of mind, and shouldn't be too difficult to add.

mjuarez
  • 16,372
  • 11
  • 56
  • 73