0

I want to filter my graph to only include vertices with less than a threshold (e.g. 50) number of edges, as so:

g.V().filter(bothE().limit(50).count().is(lt(50)))

This gives me the list of vertices that I want to keep.

How can I create a traversal object which includes only these vertices?

Background

I need to compute the k-hop neighbourhood of every single vertex in a graph which filtering out vertices that have a large number of edges (e.g. <50). The filtered graph has several million edges and vertices.

The first way of doing this that came to mind was to first filter the graph, store the result as a new subgraph, and then iterate over every vertex to find the k-hop neighbourhoods. For a single vertex v, the k=5-hop neighbourhood code is:

g.V(v).repeat(__.bothE().bothV()).times(5).dedup().toList()

A better way might be to iterate every vertex in the original, unfiltered graph and to ignore edges attached to a high-edge-count vertex, but I'm not so sure how to do this.

Attempt 1:

filtered_edges = g.V().filter(bothE().limit(50).count().is_(lt(50))).outE().toList()
subgraph = g.E(filtered_edges).subgraph('subGraph').cap('subGraph').next()

Unfortunately, when using gremlinpython an error is thrown (StreamClosedError: Stream is closed). Running other - maybe less expensive - queries before and after this error appears does not yield similar errors, so the connection to the gremlin shell is still there. The code also works in the gremlin shell (replacing is_ for is).

I guess this is because I'm sending so much data between the gremlin server and Python, but unsure as to why this would be an issue.

Attempt 2:

Using the gremlin client. I've tried overwriting another traversal object with name l. However the overwrite operation is failing (l = subgraph.traversal();).

gremlin_client = client.Client('ws://{}:{}/gremlin'.format('localhost', 8192), 'g', message_serializer=serializer.GraphSONSerializersV3d0())


command = "filtered_edges = g.V().filter(bothE().limit(50).count().is(lt(50))).outE().toList(); subgraph = g.E(filtered_edges).subgraph('subGraph').cap('subGraph').next(); l = subgraph.traversal();"
gremlin_client.submit(command).all().result()
Ian
  • 3,605
  • 4
  • 31
  • 66

1 Answers1

0

You can either continue your traversal from there:

s.V().filter(bothE().limit(50).count().is(lt(50))).out().has(...)....

or:

List<Vertex> list = s.V().filter(bothE().limit(50).count().is(lt(50))).toList()
s.V(list).out().has(...)....
stephen mallette
  • 45,298
  • 5
  • 67
  • 135
  • Thanks Stephen, the only issue is that I'm looking to use the filtered graph multiple times, so don't want to recompute `s.V(list)` each time I run a query. How can I store `s.V(list)` in order to reuse it? – Ian Aug 09 '19 at 10:47
  • I'm not really sure what additional options you have. The list of vertices is disconnected from the graph so to traverse them again to gather anything connected to them on a separate traversal means looking them up again. If you are using Java, I guess you could use `subgraph()` which would return a TinkerGraph with just the stuff you care about and then you could query that subgraph. – stephen mallette Aug 09 '19 at 12:45
  • Hi Stephen I've attempted a couple of ways to solve this problem (see updates) but I'm having little luck. I'd be really grateful if you could offer some insight on where I'm going wrong – Ian Aug 09 '19 at 14:40
  • note that i said "if you are using Java" then `subgraph()` is an option. gremlinpython does not yet have support for that particular step (there is no TinkerGraph in python to subgraph to). using a script in your "Attempt 2" should work but only if you use a session and keep the TinkerGraph on the server (i.e. set to a variable and return nothing then access it on the next request). Separately, why requery all the edges with `g.E()` and do you need all the edges in your subgraph (the vertices you wanted were only those with less than 50 edges, but what specific edges do you need)? – stephen mallette Aug 09 '19 at 14:52
  • all that said, sessions aren't a great solution. i'm still not sure i understand what problem you are trying to solve here as the entire scope of what you're trying to do isn't really explained. how many additional queries are you trying to execute on the handful of vertices you initially find? are they mutations or just additional reads to get data? if they are just additional reads, what is the nature of those reads (e.g. deep traversals over the entire graph? futher filtering on just those vertices by property?). i dunno...maybe you need to form a new question with more details. – stephen mallette Aug 09 '19 at 14:56
  • Hi Stephen, thanks for your insights. I've added some background to the question which hopefully contextualizes my query a little better, is there a way I can adapt the query (see above: the code to find the nodes in a k-hop neighbourhood) on the fly using a `where` condition instead of doing a full filter first? Sorry this question has taken such a tangent! – Ian Aug 09 '19 at 15:36
  • Actually, I think I've solved it. For a given vertex, `v`, first query to see if it has less than 50 edges, and if it does: `g.V(v).repeat(__.bothE().bothV().where(__.bothE().count().is(lt(50)))).times(5).dedup().toList()`. Apologies for my roundabout methodology ~ – Ian Aug 09 '19 at 15:50