0


I'm having problems with the insertion using gremlin to Neptune. I am trying to insert many nodes and edges, potentially hundred thousands of nodes and edges, with checking for existence.

Currently, we are using inject to insert the nodes, and the problem is that it is slow.

After running the explain command, we figured out that the problem was the coalesce and the where steps - it takes more than 99.9% of the run duration.

I want to insert each node and edge only if it doesn’t exist, and that’s why I am using the coalesce and where steps.

For example, the query we use to insert nodes with inject:

properties_list = [{‘uid’:’1642’},{‘uid’:’1322’}…]
g.inject(properties_list).unfold().as_('node')
  .sideEffect(__.V().where(P.eq('node')).by(‘uid).fold()
  .coalesce(__.unfold(), __.addV(label).property(Cardinality.single,'uid','1')))

With 1000 nodes in the graph and properties_list with 100 elements, running the query above takes around 30 seconds, and it gets slower as the number of nodes in the graph increases.
Running a naive injection with the same environment as the query above, without coalesce and where, takes less than 1 second. I’d like to hear your suggestions and to know what are the best practices for inserting many nodes and edges (with checking for existence).

Thank you very much.

ronenpi18
  • 56
  • 4

1 Answers1

2

If you have a set of IDs that you want to check for existence, you can speed up the query significantly by also providing just a list of IDs to the query and calculating the intersection of the ones that exist upfront. Then, having calculated the set that need updates you can just apply them in one go. This will make a big difference. The reason you are running into problems is that the mid traversal V has a lot of work to do. In general it would be better to use actual IDs rather than properties (UID in your case). If that is not an option the same technique will work for property based IDs. The steps are:

  1. Using inject or sideEffect insert the IDs to be found as one list and the corresponding map containing the changes to conditionally be applied in a separate map.
  2. Find the intersection of the ones that exist and those that do not.
  3. Using that set of non existing ones, apply the updates using the values in the set to index into your map.

Here is a concrete example. I used the graph-notebook for this but you can do the same thing in code:

Given:

ids = "['1','2','9998','9999']"

and

data = "[['id':'1','value':'XYZ'],['id':'9998','value':'ABC'],['id':'9999','value':'DEF']]"

we can do something like this:

g.V().hasId(${ids}).id().fold().as('exist').
      constant(${data}).
      unfold().as('d').
      where(without('exist')).by('id').by()

which correctly finds the ones that do not already exist:

{'id': 9998, 'value': 'ABC'}
{'id': 9999, 'value': 'DEF'}

You can use this pattern to construct your conditional inserts a lot more efficiently (I hope :-) ). So to add the new vertices you might do:


g.V().hasId(${ids}).id().fold().as('exist').
      constant(${data}).
      unfold().as('d').
      where(without('exist')).by('id').by().
      addV('test').
        property(id,select('d').select('id')).
        property('value',select('d').select('value'))

v[9998]
v[9999]

As a side note, we are adding two new steps to Gremlin - mergeV and mergeE that will allow this to be done much more easily and in a more declarative style. Those new steps should be part of the TinkerPop 3.6 release.

Kelvin Lawrence
  • 14,674
  • 2
  • 16
  • 38
  • Thank you for your answer! Uploading the nodes worked great as you suggested. I tried to do the same with the edges, but for some reason it’s not working. – ronenpi18 Mar 03 '22 at 18:01
  • My query is the following: g .V().hasId(list(nodes_ids)).as_('vertices') .constant(not_exist_uids_properties).unfold().as_('edge') .select('edge').select('from').as_('from_uid') .select('edge').select('to').as_('to_uid') .select('edge').select('label').as_('label') .select('vertices') .where(P.eq('to_uid')).by('uid').as_('_to') .select('vertices') .where(P.eq('from_uid')).by('uid').as_('_from') .addE('label') .to('_from') .iterate() – ronenpi18 Mar 03 '22 at 18:01
  • The query finished successfully but not edge is being added. Do you have any suggestion as to how to add edges properly, given a dict of sources and destination like the not_exist_uids_properties dict above? Thank you very much! – ronenpi18 Mar 03 '22 at 18:01
  • Without seeing the exact contents of the lists and maps it is hard to say what might be wrong. The pattern I shared in the answer should work equally well for vertices and edges but there may be some minor nuances depending on the exact data. Perhaps you could add all of this information to the question as we are now almost into a new question. – Kelvin Lawrence Mar 03 '22 at 20:19