4

Working on loading data to Neptune using gremlin , Having Neptune Infrastructure of DB Instance size (db.r5.4xlarge(16 vCPUs)). Data is loaded to Neptune via AWS Glue job with 5 worker threads using pyspark.

Loading data by doing an upsertion with deduped dataset and batching(50 records/batch) them together as single query to Neptune,

Vertices : Compute all vertices to be loaded in graph after deduping (There are no duplicate nodes)

Query used :

g.V().has(T.id, record.id).fold().coalesce(__.unfold(),__.addV(record.source).property(T.id, record.id)
.V().has(T.id, record.id).fold().coalesce(__.unfold(),__.addV(record.source).property(T.id, record.id)
(Do 48 items).next()

Time taken to perform for 2.45M unique vertices is 5 mins

Edges: Compute all edges to be loaded in graph after deduping (There are no duplicate edges)

Query used :

g.V(edgeData.id1).bothE().where(__.otherV().hasId(edgeData.id2)).fold().coalesce(__.unfold(),__.addE('coincided_with').from_(__.V(edgeData.id1)).to(__.V(edgeData.id2))).property(Cardinality.single, timestamp, edgeData.timestamp).property(Cardinality.single, count, edgeData.count)
.V(edgeData.id1).bothE().where(__.otherV().hasId(edgeData.id2)).fold().coalesce(__.unfold(),__.addE('coincided_with').from_(__.V(edgeData.id1)).to(__.V(edgeData.id2))).property(Cardinality.single, timestamp, edgeData.timestamp).property(Cardinality.single, count, edgeData.count)
(Do 48 items).next()

Time taken to perform for 1.88M unique edges with properties is 21 mins

If we perform just edge creation alone without any properties to edge ,

Query used :

 g.V(edgeData.id1).bothE().where(__.otherV().hasId(edgeData.id2)).fold().coalesce(__.unfold(),__.addE('coincided_with').from_(__.V(edgeData.id1)).to(__.V(edgeData.id2)))
.V(edgeData.id1).bothE().where(__.otherV().hasId(edgeData.id2)).fold().coalesce(__.unfold(),__.addE('coincided_with').from_(__.V(edgeData.id1)).to(__.V(edgeData.id2)))
(Do 48 items).next()

Time taken to perform for 1.88M unique edges without properties is 4 mins

Performance Issues:

  1. Ideally while inserting vertices we shouldn’t be seeing any ConcurrentModification exception, but we do get it frequently even while creating vertices in a fresh instance of Neptune (db.r5.4xlarge), We have mitigated this by doing retry logic on them , There are cases while doing edge inserts from Vertex (A -> B) even after retrying 10 times with interval of 300 millisecond it still fails to insert them. Overall issue, we are ended up with more time to insert our data and is there a way to avoid concurrent exception even though we are avoiding the scenarios of concurrency.
  2. While adding edges properties during batch upsertion, we could see the time taken is much longer than upserting edges without properties Eg: Adding 2 properties to edges 1.8M edge with properties took close to 21 min to upsert our data 1.8M edges without properties took close to 4 min to upsert our data Edge Creation with properties is much slower , is there anyway to speed up loading of edges with properties (we have 40M edges so the time to insert is much longer)
  3. Adding more parallel worker threads , we endup being much slower and concurrency errors are more (cpu load is around 50% and its not maxing)

Any suggestions to improve performance would be much of help

Kelvin Lawrence
  • 14,674
  • 2
  • 16
  • 38
Gokulakrishnan
  • 227
  • 1
  • 2
  • 12
  • What version of Neptune are you using? – Taylor Riggan Jul 14 '21 at 14:03
  • neptune version is 1.0.4.1.R4 – Gokulakrishnan Jul 14 '21 at 15:25
  • Update to 1.0.4.2Rx. There is a mid-traversal V() fix that was applied in that version. Before that, the use of V() in the middle of a query can cause locks to be taken on a much wider portion of the database. – Taylor Riggan Jul 14 '21 at 15:43
  • @TaylorRiggan Yes i have done the upgrade and performed the run, got some improvement for sure, it got completed in 16 mins reduced 5 mins, but i do still get concurrent exceptions – Gokulakrishnan Jul 15 '21 at 11:55
  • If you're naively writing to the graph and not considering locks associated to adjacent components, then you're going to encounter CMEs. In most cases, an immediate retry should take care of the query that was thrown the CME. The only other means to address this is to partition your writes in a manner which reduces the likelihood of concurrent threads modifying the same components in the graph. – Taylor Riggan Jul 23 '21 at 15:28

1 Answers1

1

With this many vertices and edges it might be an idea to just use the bulk uploader where you create CSV files and import them from S3: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format.html

Tip: Put the Curl commands for the loader into a Sagemaker Notebook so you can run them from there.

Jeroen Vlek
  • 307
  • 4
  • 15
  • 1
    If you use the Neptune notebooks there is also a `%load` magic that presents a form to fill in and submits the load for you. – Kelvin Lawrence Sep 22 '21 at 21:58
  • 1
    We're getting slow throughput with the bulk loader currently. Taking est. 50 hours to load. Is there a recommendation for the size of each CSV file being loaded? Is 10-20gb recommended? Currently, ours are 100mb - 1.5GB per – Ryan Sep 25 '21 at 21:17
  • You can increase parallism as an option (e.g. HIGH or OVERSUBSCRIBE). You can also consider temporarily adding beefy writer nodes (maybe put them behind a designated writer endpoint) for the duration of the load job and then scale down your cluster again once it finished (maybe automated through Terraform or CloudFormation) – Jeroen Vlek Sep 28 '21 at 08:30
  • @kevin-lawrence Is it me or does the %load magic not provide any output/feedback like a job id? I've submitted 2 jobs this way and I don't see anything happening in the notebook, but the job is carried out. Do you recognize this? – Jeroen Vlek Oct 21 '21 at 12:52