Deep graph traversals with labelled output

Question

I'm trying to write a function to generate a Gremlin query. The input of the function is an array of array of strings, with the names of relationships we want to return from the graph. The graph contains information on TV and movies. So an example input would be: [[seasons, episodes, talent], [studios, movies, images]] The strings refer to edge names.

I need to return a JSON object containing the IDs for the vertices labelled by their edge names but I'm finding the Germlin query very difficult.

So far I've managed to write this query:

g.V('network_1').out().where(__.inE().
    hasLabel('seasons')).
  group().
    by(__.inE().label()).
    by(__.group().by(T.id).
        by(__.out().where(__.inE().
            hasLabel('episodes')).
          group().
            by(__.inE().label()).
            by(__.group().by(T.id).
                by(__.out().where(__.inE().
                    hasLabel('talent')).
                  group().
                    by(__.inE().label()).by(T.id))))).
  next()

Which gives this output:

{
  "seasons": {
    "season_2": {
      "episodes": {
        "episode_4": {
          "talent": [
            "talent_8",
            "talent_6",
            "talent_7"
          ]
        }
      }
    },
    "season_1": {
      "episodes": {
        "episode_2": {
          "talent": [
            "talent_2",
            "talent_3"
          ]
        },
        "episode_3": {
          "talent": [
            "talent_4",
            "talent_5"
          ]
        },
        "episode_1": {
          "talent": [
            "talent_1"
          ]
        }
      }
    }
  }
}

That output is exactly the kind of thing I'm looking for however the problems are:

That query seems hugely over complicated
The array of edges to query could be any size. In my example its 3, but it could be anything.
In the example there are 2 arrays of edges to query, which ideally I could combine into one query

I'm writing this in Python, and would be hugely appreciative of any help or pointers.

Example content:

g.addV('show').property('id', 'show_1').as('show_1').
  addV('season').property('id', 'season_1').as('season_1').
  addV('season').property('id', 'season_2').as('season_2').
  addV('episode').property('id', 'episode_1').as('episode_1').
  addV('episode').property('id', 'episode_2').as('episode_2').
  addV('episode').property('id', 'episode_3').as('episode_3').
  addV('episode').property('id', 'episode_4').as('episode_4').
  addV('talent').property('id', 'talent_1').as('talent_1').
  addV('talent').property('id', 'talent_2').as('talent_2').
  addV('talent').property('id', 'talent_3').as('talent_3').
  addV('talent').property('id', 'talent_4').as('talent_4').
  addV('talent').property('id', 'talent_5').as('talent_5').
  addV('talent').property('id', 'talent_6').as('talent_6').
  addV('talent').property('id', 'talent_7').as('talent_7').
  addV('talent').property('id', 'talent_8').as('talent_8').
  addE('seasons').from('show_1').to('season_1').
  addE('seasons').from('show_1').to('season_2').
  addE('episodes').from('season_1').to('episode_1').
  addE('episodes').from('season_1').to('episode_2').
  addE('episodes').from('season_1').to('episode_3').
  addE('episodes').from('season_2').to('episode_4').
  addE('talent').from('episode_1').to('talent_1').
  addE('talent').from('episode_2').to('talent_2').
  addE('talent').from('episode_2').to('talent_3').
  addE('talent').from('episode_3').to('talent_4').
  addE('talent').from('episode_3').to('talent_5').
  addE('talent').from('episode_4').to('talent_6').
  addE('talent').from('episode_4').to('talent_7').
  addE('talent').from('episode_4').to('talent_8').iterate()

Could you please provide a Gremlin script that creates some sample data - here is an example https://stackoverflow.com/questions/51388315/gremlin-choose-one-item-at-random — stephen mallette, Jan 19 '21 at 12:29
Hi @stephenmallette, I've added a script to create some data and updated the query and output in my original question to reflect that data. Thanks. — TKems, Jan 19 '21 at 23:19

stephen mallette · Accepted Answer · 2021-01-21T15:46:48.987

For JVM language variants of Gremlin, I think tree() would be quite helpful to you:

gremlin> g.V().out('seasons').
......1>   out('episodes').
......2>   out('talent').
......3>   tree().
......4>     by('id').next()
==>show_1={season_2={episode_4={talent_6={}, talent_8={}, talent_7={}}}, season_1={episode_2={talent_3={}, talent_2={}}, episode_3={talent_5={}, talent_4={}}, episode_1={talent_1={}}}}

but to the best of my recollection tree() off of the JVM, in your case Python, isn't well supported. You might try it though.

Another option, one more tuned to Python right now, is to do some nested grouping as you have done in your example. You note it as complex, but I think it only so because of the backtrack filtering everywhere. I'd also add that while it might appear to work, I sense that it might not quite work in all cases given the use of by(__.inE().label()) to group on as that only looks at the first edge label for each vertex being grouped. It relies on the structure of the data to be successful, so it might set you up for a bug in the future if suddenly inE() returned something you didn't expect. I suppose you could limit that chance by adding the label like inE('seasons).label()` but that seems a bit off.

I tend to favor Gremlin that is immediately readable as to its intent. As such, I took the following approach (it doesn't exactly match the output you provided with all the key values but I think you will find the data to match what you want:

gremlin> g.V().out('seasons').
......1>   out('episodes').
......2>   out('talent').
......3>   path().
......4>     by('id').
......5>   group().
......6>     by(limit(local,1)).
......7>     by(tail(local,3).
......8>        group().
......9>          by(limit(local,1)).
.....10>          by(tail(local,2).
.....11>             group().
.....12>               by(limit(local,1)).
.....13>               by(tail(local).fold())))
==>[show_1:[season_2:[episode_4:[talent_6,talent_7,talent_8]],season_1:[episode_2:[talent_2,talent_3],episode_3:[talent_4,talent_5],episode_1:[talent_1]]]]

I like this approach because the navigation part is so simple and direct - out() over "seasons", out() over "episodes" and out() over "talent". There is no question as to what data is being gathered. At line 3 we gather the path and then do a nested group over it to build a similar tree-like structure that I'd generated with tree()-step. In fact this one is a bit nicer in terms of output because it doesn't include empty leaves.

To pick this apart a bit further, start by considering the base output we're working with:

gremlin> g.V().out('seasons').
......1>   out('episodes').
......2>   out('talent').
......3>   path().
......4>     by('id')
==>[show_1,season_1,episode_1,talent_1]
==>[show_1,season_1,episode_2,talent_2]
==>[show_1,season_1,episode_2,talent_3]
==>[show_1,season_1,episode_3,talent_4]
==>[show_1,season_1,episode_3,talent_5]
==>[show_1,season_2,episode_4,talent_6]
==>[show_1,season_2,episode_4,talent_7]
==>[show_1,season_2,episode_4,talent_8]

We want to group on each layer of those paths, which means doing an nested group(). Consider the first layer:

gremlin> g.V().out('seasons').
......1>   out('episodes').
......2>   out('talent').
......3>   path().
......4>     by('id').
......5>   group().
......6>     by(limit(local,1)).
......7>     by(tail(local,3).fold())
==>[show_1:[[season_1,episode_1,talent_1],[season_1,episode_2,talent_2],[season_1,episode_2,talent_3],[season_1,episode_3,talent_4],[season_1,episode_3,talent_5],[season_2,episode_4,talent_6],[season_2,episode_4,talent_7],[season_2,episode_4,talent_8]]]

The above puts all the "shows" together. Note how we've used tail(local,3) to remove "show_1" from each path object since we've already grouped on it. Next we want to group the "seasons" so:

gremlin> g.V().out('seasons').
......1>   out('episodes').
......2>   out('talent').
......3>   path().
......4>     by('id').
......5>   group().
......6>     by(limit(local,1)).
......7>     by(tail(local,3).
......8>        group().
......9>          by(limit(local,1)).
.....10>          by(tail(local,2).fold()))
==>[show_1:[season_2:[[episode_4,talent_6],[episode_4,talent_7],[episode_4,talent_8]],season_1:[[episode_1,talent_1],[episode_2,talent_2],[episode_2,talent_3],[episode_3,talent_4],[episode_3,talent_5]]]]

Here we know that "seasons" are in the first position so we take the first with limit(local,1) and as we no longer need seasons for further grouping we chop it off the path with tail(local,2). It's "2" this time instead of "3" because the path we are reducing is shortened to just season->episode->talent and now with "2" we go to just episode->talent. Hopefully that breaks down what's happening a bit further and you can adapt this query to your needs.

Thanks for getting back to me Stephen. I also like your approach of getting all the data first then organising it. It's very neat. However the function which builds the query may take any depth of query. In my example I gave `seasons->episodes->talent` but it query could be just `seasons->episodes`, or maybe `seasons->episodes->talent->images` etc. What would I need to change in the grouping to accommodate that? — TKems, Jan 21 '21 at 14:51
note that each `group()` is just a level in your tree. There are three `out()` so there must be three embedded `group()`. If you have just two levels like `seasons->episodes` then you would need just the last two `group()` forms. I updated the answer to try to demonstrate further — stephen mallette, Jan 21 '21 at 15:47
Thanks for the update. This has really helped me understand. I’ve been playing around with this query and something I’ve noticed is that if one step out the query has no results, nothing is returned at all. For example with seasons->episodes->talent, if the episode has no talent relationships the query will return nothing, whereas it would be nice to have the seasons and episodes that were found. — TKems, Jan 22 '21 at 20:46
Have a look at `optional()` https://tinkerpop.apache.org/docs/current/reference/#optional-step - if you wrap that around the `out('talent')` then the traversal can continue to execute without that edge being present. — stephen mallette, Jan 25 '21 at 11:41
Ah, yes that seems to be what I'm looking for! Thanks for your help, I'll mark your response as the answer. You've been a huge help. — TKems, Jan 25 '21 at 15:35

Deep graph traversals with labelled output

1 Answers1