0

I am trying to check and insert 1000 vertices in chunk using promise.all(). The code is as follows:

public async createManyByKey(label: string, key: string, properties: object[]): Promise<T[]> {
    const promises = [];
    const allVertices = __.addV(label);
    const propKeys: Array<string> = Object.keys(properties[0]);
     
    for(const propKey of propKeys){
      allVertices.property(propKey, __.select(propKey));
    }

    const chunkedProperties = chunk(properties, 5); // [["demo-1", "demo-2", "demo-3", "demo-4", "demo-5"], [...], ...]
    
    for(const property of chunkedProperties){
        const singleQuery = this.g.withSideEffect('User', property)
       .inject(property)
       .unfold().as('data')
       .coalesce(__.V().hasLabel(label).where(eq('data')).by(key).by(__.select(key)), allVertices).iterate();

       promises.push(singleQuery);
     }

    const result = await Promise.all(promises);

    return result;
  }

This code throws ConcurrentModificationException. Need help to fix/improve this issue.

codegutsy
  • 37
  • 12

1 Answers1

0

I'm not quite sure about the data and parameters you are using, but I needed to modify your query a bit to get it to work with a data set I have handy (air routes) as shown below. I did this to help me think through what your query is doing. I had to change the second by step. I'm not sure how that was working otherwise.

gremlin> g.inject(['AUS','ATL','XXX']).unfold().as('d').
......1>   coalesce(__.V().hasLabel('airport').limit(10).
......2>            where(eq('d')).
......3>              by('code').
......4>              by(), 
......5>            constant('X'))  
==>v['3']
==>v['1']
==>X 

While a query like this runs fine in isolation, once you start running several asynchronous promises (that contain mutating steps as in your query), what can happen is that one promise tries to access a part of the graph that is locked by another one. Even though the execution I believe is more "concurrent" than truly "parallel" if one promise yields due to an IO wait allowing another to run, the next one may fail if the prior promise already has locks in the database that the next promise also needs. In your case as you have a coalesce that references all vertices with a given label and properties, that can potentially cause conflicting locks to be taken. Perhaps it will work better if you await after each for loop iteration rather than do it all at the end in one big Promise.all.

Something else to keep in mind is that this query is going to be somewhat expensive regardless, as the mid traversal V is going to happen five times (in the case of your example) for each for loop iteration. This is because the unfold of the injected data is taken from chunks of size 5 and therefore spawns five traversers, each of which starts by looking at V.

EDITED 2021-11-17

As discussed a little in the comments, I suspect the most optimal path is actually to use multiple queries. The first query simply does a g.V(id1,id2,...) on all the IDs you are potentially going to add. Have it return a list of IDs found. Remove those from the set to add. Next break the adding part up into batches and do it without coalesce as you now know that those elements do not exist. This is most likely the best way to reduce locking and avoid the CMEs (exceptions). Unless someone else may be also trying to add them in parallel, this is the approach I think I would take.

Kelvin Lawrence
  • 14,674
  • 2
  • 16
  • 38
  • if ```promise.all()``` won't work as expected then how can I improve the performance. I tried inserting 1000 vertices in a single query it took more than 2 mins. The performance was something like as follows:- - 100 vertices - 1.2 secs - 300 vertices - 20 secs - 500 vertices - 45 secs - 900 vertices - 1 mins 30 secs Is there any way to improve this performace – codegutsy Nov 12 '21 at 08:35
  • Before exploring options, were you able to verify that doing the `await` after each for loop iteration removed the exceptions? – Kelvin Lawrence Nov 12 '21 at 10:49
  • I tried that it executes in a synchronous/sequential way. So it waits until each loop is completed and this is even slower when comes to large numbers. – codegutsy Nov 15 '21 at 05:22
  • Normally the best way to get good write performance is a multi threaded approach where each query adds something like 50 to 100 vertices at a time - not dissimilar to what you are trying. What complicates things in your case is that your query is likely taking locks on large parts of the graph due to the `coalesce` step and in particular the `hasLabel(label)` and so a fully asynchronous/multi threaded approach is going to have issues with exceptions. The mid traversal `V`, as you will see if you profile the query, is also causing a big fanout of vertices visited. – Kelvin Lawrence Nov 15 '21 at 15:38
  • In this specific case it might be best to run 2 queries where the first finds all the vertices not present and the second adds them without needing a `coalesce`. – Kelvin Lawrence Nov 15 '21 at 15:40
  • Okay. I will try this approach by executing 50 to 100 vertices at a time. Thank you, Kelvin. – codegutsy Nov 15 '21 at 17:55
  • This didn't work as expected. It did reduce the execution time but this query actually fails to check whether the vertex exists or not. If not exists then it should create it. Following is the gremlin query which worked earlier: g.inject([["userId":"user10", "name":"user 10"],["userId":"user11", "name":"user 11"],["userId":"user12", "name":"user 12"]]) .unfold().as('data') .coalesce(V().hasLabel('User').where(eq('data')).by('name').by(select('name')), addV('User').property('userId',select('userId')).property('name', select('name')).elementMap() – codegutsy Nov 17 '21 at 12:11
  • As discussed a little above, I suspect the most optimal path is to use multiple queries. The first query does a `g.V()` on all the IDs you are potentially going to add. Have it return a list of IDs found. Remove those from the set to add. Next break the adding part up into batches and do it without `coalesce` as you now know that those elements do not exist. This is most likely the best way to reduce locking and avoid the CMEs (exceptions). Unless someone else may be also trying to add them in parallel, this is the approach I think I would take. – Kelvin Lawrence Nov 17 '21 at 20:23