I have a system (Cassandra) that contains data that I would like to one-way "sync" with my Tinkerpop enabled store (I use AWS Neptune). By one-way sync I mean the data is only ever updated by the sync process from the source of truth to the graph store and is read-only for users of the graph store.
The source of truth holds a relatively small data set, when built as a graph it comprises <1MM vertices and edges.
I'm looking at the following two approaches:
A) Using Neptune Bulk Loader:
- Whenever the source of truth changes, dump all the data as a snapshot into a file (possibly use change events for deltas in the future)
- Read all interested data from the graph store and compute the nodes and vertices to upsert
- Write all nodes and vertices to csv files and load them into Neptune
Pros:
- fastest way to load data into Neptune
Cons:
- unsafe: if bulk load fails half-way, the graph store is left in an inconsistent state
B) Use a Session with the Tinkerpop SessionedClient
- Whenever the source of truth changes, dump all the data as a snapshot into a file (possibly use change events for deltas in the future)
- Read all interested data from the graph store and compute the nodes and vertices to upsert.
- Send batches of Gremlin queries to upsert and delete nodes and vertices using a single session
Pros:
- safe: since the same session is used throughout the sync, if one Gremlin query fails, everything is rolled back
Cons:
- script-only: SessionedClient only allows Gremlin scripts, so instead of being able to use bytecode, I have to concatenate strings to make Gremlin scripts. Not ideal, but it seems to work.
- slower than bulk loader
- 10-minute limit: a session can only be open for at most 10min by Neptune limiting the sync to 10min. I don't think the loading will take more than 10-min due to the size of the data.
I tried both options with small data sets. I also tried using the regular one-transaction-per-request java client but it does not feel future-proof to send all the changes in a single request. Am I correct?
I'm about to embark on productizing approach B and I would like to know if there are any pitfalls I should look out for or other options I haven't considered?