0

I have a system (Cassandra) that contains data that I would like to one-way "sync" with my Tinkerpop enabled store (I use AWS Neptune). By one-way sync I mean the data is only ever updated by the sync process from the source of truth to the graph store and is read-only for users of the graph store.

The source of truth holds a relatively small data set, when built as a graph it comprises <1MM vertices and edges.

I'm looking at the following two approaches:

A) Using Neptune Bulk Loader:

  1. Whenever the source of truth changes, dump all the data as a snapshot into a file (possibly use change events for deltas in the future)
  2. Read all interested data from the graph store and compute the nodes and vertices to upsert
  3. Write all nodes and vertices to csv files and load them into Neptune

Pros:

  • fastest way to load data into Neptune

Cons:

  • unsafe: if bulk load fails half-way, the graph store is left in an inconsistent state

B) Use a Session with the Tinkerpop SessionedClient

  1. Whenever the source of truth changes, dump all the data as a snapshot into a file (possibly use change events for deltas in the future)
  2. Read all interested data from the graph store and compute the nodes and vertices to upsert.
  3. Send batches of Gremlin queries to upsert and delete nodes and vertices using a single session

Pros:

  • safe: since the same session is used throughout the sync, if one Gremlin query fails, everything is rolled back

Cons:

  • script-only: SessionedClient only allows Gremlin scripts, so instead of being able to use bytecode, I have to concatenate strings to make Gremlin scripts. Not ideal, but it seems to work.
  • slower than bulk loader
  • 10-minute limit: a session can only be open for at most 10min by Neptune limiting the sync to 10min. I don't think the loading will take more than 10-min due to the size of the data.

I tried both options with small data sets. I also tried using the regular one-transaction-per-request java client but it does not feel future-proof to send all the changes in a single request. Am I correct?

I'm about to embark on productizing approach B and I would like to know if there are any pitfalls I should look out for or other options I haven't considered?

Tinou
  • 5,908
  • 4
  • 21
  • 24

1 Answers1

0

A few thoughts - you have already done a good job thinking through some of the pros and cons.

  1. With Neptune if you are regularly doing a significant number of writes to add data in a non upsert fashion, the bulk loader is a good choice. However as you note, the semantics of the bulk loader are "do the best you can" and either load every valid CSV row or fail as soon as one row is found to be invalid. If you can guarantee by screening that your data is clean ahead of time the bulk loader may still be a good option.

  2. Gremlin Sessions give you more control over the transaction but as you noted currently the queries have to be sent in text form. However, in the TinkerPop 3.5.0 release, support for ByteCode transactions was added. Initially for the Java client with others to follow soon. Node is in the works already and hopefully Python soo after. Once Amazon Neptune moves up to the TinkerPop 3.5.x level you will be able to take advantage of the new ByteCode transaction syntax/semantics. Note that the 10 minute limit on sessions is for when the session is idle. Any activity will reset the timer back to 10 minutes again.

Kelvin Lawrence
  • 14,674
  • 2
  • 16
  • 38
  • Thank you for the thoughtful answer, I went ahead and confirmed one could hold a session for longer than 10minutes as long as it's not idle. Also, for whoever read this, I was able to use the GroovyTranslator to translate bytcode to gremlin script. (no need to concatenate strings manually) – Tinou Jul 12 '21 at 17:56