Goal
The objective is to efficiently generate random walks on a relatively large graph with uneven probabilities of going through edges depending on their type.
Configuration
- Ubuntu VM, 23Go RAM
- JanusGraph 0.6.1 full
- Local graph (default
conf/remote.yaml
file used) - ~1.8m vertices (~28k will be start nodes for the random walks)
- ~21m relationships (they can all be used in the random walks)
What I am doing
I am currently generating random walks with the sample
command:
g.V(<startnode_id>).
repeat( local( both().sample(1) ) ).
times(<desired_randomwalk_length>).
path()
What I tried
I tried using a gremlinpython script to create a random walk generator that would first get all edges connected to the current node, then pick randomly an edge to go through and repeat <desired_randomwalk_length>
times.
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import traversal
from gremlin_python.structure.graph import Vertex
from typing import List
connection = DriverRemoteConnection(<URL>, "g")
g = traversal().withRemote(connection)
def get_next_node(start:Vertex) -> Vertex:
next_vertices = g.V(start.id).both().fold().next()
return next_vertices[randint(0, len(next_vertices)-1)]
def get_random_walk(start:Vertex, length:int=10) -> List[Vertex]:
current_node = start
random_walk = [current_node]
for _ in range(length):
current_node = get_next_node(current_node)
random_walk.append(current_node)
return random_walk
Issues
While testing on a subset of the total graph (400k vertices, 1.5m rel), I got these results
- Sample query,
<desired_randomwalk_length>
of 10: 100k random walks in 1h10 - Gremlinpython function,
<desired_randomwalk_length>
of 4: 2k random walks in 1h+
The sample command is really fast, but there are a few problems:
- It doesn't seem to truly be a uniform distribution pick amongst the edges (it seems to be successive coin tosses) which could lead to certain paths being taken more often, which then diminishes the interest of generating random walks. (I can't directly do what is recommended here as the nodes ids aren't in a sequence, thus I have to acquire them first.)
- I haven't found a way to give different probabilities to different types of relationships.
Is there a better way to do random walks with Gremlin?
If there is none, is there a way to modify the sample query to rectify the assign probabilities to types of edges? Maybe even a way to have a better distribution of the sampling?
In last recourse, is there a way to improve the queries to make this "by hand" with a gremlinpython script?
Thanks to everyone reading/replying!
EDIT
Is there a way to do the following:
- Given a
r_type1
,r_type2
,r_type3
, ... the acceptable relationship type for this random walk - Given a
proba1
,proba2
,proba3
, ... the probabilities of going through these relationship types
For each step
- Sample a node for each relationship type
r_type1
,r_type2
,r_type3
, ... - Keep only one according to the probabilities
proba1
,proba2
,proba3
, ...
I think the second step could be done be sampling multiple nodes for each relationships, in accordance with the probas (which could be done by using a gremlinpython
script to build the query). This still leaves the question of how to sample on multiple relationships from a single node, and how to randomly pick one in the sampled nodes.
I hope this is clear!