Efficient controlled random walks in gremlin

Question

Goal

The objective is to efficiently generate random walks on a relatively large graph with uneven probabilities of going through edges depending on their type.

Configuration

Ubuntu VM, 23Go RAM
JanusGraph 0.6.1 full
Local graph (default conf/remote.yaml file used)
~1.8m vertices (~28k will be start nodes for the random walks)
~21m relationships (they can all be used in the random walks)

What I am doing

I am currently generating random walks with the sample command:

g.V(<startnode_id>).
  repeat( local( both().sample(1) ) ).
    times(<desired_randomwalk_length>).
  path()

What I tried

I tried using a gremlinpython script to create a random walk generator that would first get all edges connected to the current node, then pick randomly an edge to go through and repeat <desired_randomwalk_length> times.

from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import traversal
from gremlin_python.structure.graph import Vertex
from typing import List

connection = DriverRemoteConnection(<URL>, "g")
g = traversal().withRemote(connection)

def get_next_node(start:Vertex) -> Vertex:
  next_vertices = g.V(start.id).both().fold().next()
  return next_vertices[randint(0, len(next_vertices)-1)]
def get_random_walk(start:Vertex, length:int=10) -> List[Vertex]:
  current_node = start
  random_walk = [current_node]
  for _ in range(length):
    current_node = get_next_node(current_node)
    random_walk.append(current_node)
  return random_walk

Issues

While testing on a subset of the total graph (400k vertices, 1.5m rel), I got these results

Sample query, <desired_randomwalk_length> of 10: 100k random walks in 1h10
Gremlinpython function, <desired_randomwalk_length> of 4: 2k random walks in 1h+

The sample command is really fast, but there are a few problems:

It doesn't seem to truly be a uniform distribution pick amongst the edges (it seems to be successive coin tosses) which could lead to certain paths being taken more often, which then diminishes the interest of generating random walks. (I can't directly do what is recommended here as the nodes ids aren't in a sequence, thus I have to acquire them first.)
I haven't found a way to give different probabilities to different types of relationships.

Is there a better way to do random walks with Gremlin?
If there is none, is there a way to modify the sample query to rectify the assign probabilities to types of edges? Maybe even a way to have a better distribution of the sampling?
In last recourse, is there a way to improve the queries to make this "by hand" with a gremlinpython script?

Thanks to everyone reading/replying!

EDIT

Is there a way to do the following:

Given a r_type1, r_type2, r_type3, ... the acceptable relationship type for this random walk
Given a proba1, proba2, proba3, ... the probabilities of going through these relationship types

For each step

Sample a node for each relationship type r_type1, r_type2, r_type3, ...
Keep only one according to the probabilities proba1, proba2, proba3, ...

I think the second step could be done be sampling multiple nodes for each relationships, in accordance with the probas (which could be done by using a gremlinpython script to build the query). This still leaves the question of how to sample on multiple relationships from a single node, and how to randomly pick one in the sampled nodes.

I hope this is clear!

In general, the approach using the `sample` step is the one that I use. That's also the way it is suggested in the TinkerPop documentation. Doing that over a series of concurrent tasks should yield some reasonable walks. — Kelvin Lawrence, Jul 05 '22 at 23:49
I agree that it yields reasonable results. I would like to have more control over this still, as there are relationships that are not relevant for the random walks I'm trying to generate. Would there be a way to do the operations described in my "Edit" in the post? — GregoirePelegrin, Jul 06 '22 at 08:18
To make sure I understand the edit, you would look at all the edges that have a label from an allowed set and for each of those look at an edge property that is some kind of weighting, or would the weighting be injected into the query? — Kelvin Lawrence, Jul 06 '22 at 21:16
Initial thought: I would look at all the edges that have a label from an allowed set, sample one for each (`s1` for edge of type `label1`, `s2` for edge of type `label2`, etc... for `labelX` in allowed set). Then I would keep only one of them according to a probability distribution (50% for `label1`, 10% for `label2`, etc...) — GregoirePelegrin, Jul 07 '22 at 07:21
Then I realized that if I have the following probability distribution: 50% `label1`, 20% `label2`, 30% `label3`, I can just sample 5 edges of type `label1`, 2 edges of type `label2` and 3 edges of type `label3`, then randomly (uniformly) select one of them. Furthermore, as I am building my queries with `gremlinpython`, I can dynamically compute the number of samples I need from each type. Thus, "all" I need left is: From a given node, how to sample edges of a certain type, then another type, then another type, etc..., append the results in a list, and randomly pick an element from that list. — GregoirePelegrin, Jul 07 '22 at 07:27

score 2 · Accepted Answer · answered Aug 02 '22 at 07:19

Thanks to @Kelvin Lawrence's Practical Gremlin (especially the union section), I managed to do what I wanted (or close enough).

The Gremlin query I get is the following:

g.V(<vertex_id>).
  repeat(
    local(
      union(
        both(<relationship_type1>).sample(N1),
        both(<relationship_type2>).sample(N2),
        ...
      ).
      sample(1)
    )
  ).times(<walk_length>).
  path()

The N_ values are set independently of the node, such that the least probable transition yields exactly 1 sample. This also means that the probabilities are not exactly respected where the number of relationships of a given type is inferior to the corresponding N_ value.

The union part is built in python using gremlinpython (nb_samples is the dictionary storing the number of samples needed for each relationship type)

from gremlin_python.process.graph_traversal import __, GraphTraversal

next_node_traversal:GraphTraversal = __.union(
  *[
    __.both(key).sample(nb_samples[key])
    for key in nb_samples
  ]
).sample(1)

(Here we are using the * operator to unpack the list when passing it as argument to the union method)