Working on loading data to Neptune using gremlin , Having Neptune Infrastructure of DB Instance size (db.r5.4xlarge(16 vCPUs)). Data is loaded to Neptune via AWS Glue job with 5 worker threads using pyspark.
Loading data by doing an upsertion with deduped dataset and batching(50 records/batch) them together as single query to Neptune,
Vertices : Compute all vertices to be loaded in graph after deduping (There are no duplicate nodes)
Query used :
g.V().has(T.id, record.id).fold().coalesce(__.unfold(),__.addV(record.source).property(T.id, record.id)
.V().has(T.id, record.id).fold().coalesce(__.unfold(),__.addV(record.source).property(T.id, record.id)
(Do 48 items).next()
Time taken to perform for 2.45M unique vertices is 5 mins
Edges: Compute all edges to be loaded in graph after deduping (There are no duplicate edges)
Query used :
g.V(edgeData.id1).bothE().where(__.otherV().hasId(edgeData.id2)).fold().coalesce(__.unfold(),__.addE('coincided_with').from_(__.V(edgeData.id1)).to(__.V(edgeData.id2))).property(Cardinality.single, timestamp, edgeData.timestamp).property(Cardinality.single, count, edgeData.count)
.V(edgeData.id1).bothE().where(__.otherV().hasId(edgeData.id2)).fold().coalesce(__.unfold(),__.addE('coincided_with').from_(__.V(edgeData.id1)).to(__.V(edgeData.id2))).property(Cardinality.single, timestamp, edgeData.timestamp).property(Cardinality.single, count, edgeData.count)
(Do 48 items).next()
Time taken to perform for 1.88M unique edges with properties is 21 mins
If we perform just edge creation alone without any properties to edge ,
Query used :
g.V(edgeData.id1).bothE().where(__.otherV().hasId(edgeData.id2)).fold().coalesce(__.unfold(),__.addE('coincided_with').from_(__.V(edgeData.id1)).to(__.V(edgeData.id2)))
.V(edgeData.id1).bothE().where(__.otherV().hasId(edgeData.id2)).fold().coalesce(__.unfold(),__.addE('coincided_with').from_(__.V(edgeData.id1)).to(__.V(edgeData.id2)))
(Do 48 items).next()
Time taken to perform for 1.88M unique edges without properties is 4 mins
Performance Issues:
- Ideally while inserting vertices we shouldn’t be seeing any ConcurrentModification exception, but we do get it frequently even while creating vertices in a fresh instance of Neptune (db.r5.4xlarge), We have mitigated this by doing retry logic on them , There are cases while doing edge inserts from Vertex (A -> B) even after retrying 10 times with interval of 300 millisecond it still fails to insert them. Overall issue, we are ended up with more time to insert our data and is there a way to avoid concurrent exception even though we are avoiding the scenarios of concurrency.
- While adding edges properties during batch upsertion, we could see the time taken is much longer than upserting edges without properties Eg: Adding 2 properties to edges 1.8M edge with properties took close to 21 min to upsert our data 1.8M edges without properties took close to 4 min to upsert our data Edge Creation with properties is much slower , is there anyway to speed up loading of edges with properties (we have 40M edges so the time to insert is much longer)
- Adding more parallel worker threads , we endup being much slower and concurrency errors are more (cpu load is around 50% and its not maxing)
Any suggestions to improve performance would be much of help