1

I have a big (half a million edges) weighted graph (not directional) and I want to find the distance between two nodes u and v. I could use my_graph.shortest_paths(u, v, weights='length') to get the distance. However, this is really slow.

I can also first find the path and then calculate the length of it. This is fast, but I don't understand why this is faster than calculating the length directly.

In networkx I used nx.shortest_path_length(my_graph u, v, weight='length')

I used this code to figure out the speed. For anyone who wants to run the code, I put the edgelist on Google drive here

import pandas as pd
import networkx as nx
import igraph
import time

# load edgelist
edgelist = pd.read_pickle('edgelist.pkl')

# create igraph
tuples = [tuple(x) for x in edgelist[['u', 'v', 'length']].values]
graph_igraph = igraph.Graph.TupleList(tuples, directed=False, edge_attrs=['length'])

# create nx graph
graph_nx = nx.from_pandas_edgelist(edgelist, source='u', target='v', edge_attr=True)


def distance_shortest_path(u, v):
    return graph_igraph.shortest_paths(u, v, weights='length')[0]

get_length = lambda edge: graph_igraph.es[edge]['length']
def distance_path_then_sum(u, v):
    path = graph_igraph.get_shortest_paths(u, v, weights='length', output='epath')[0]
    return sum(map(get_length, path))

def distance_nx(u, v):
    return nx.shortest_path_length(graph_nx, u, v, weight='length')


some_nodes = [
    'Delitzsch unt Bf',
    'Neustadt(Holst)Gbf',
    'Delitzsch ob Bf',
    'Karlshagen',
    'Berlin-Karlshorst (S)',
    'Köln/Bonn Flughafen',
    'Mannheim Hbf',
    'Neu-Edingen/Friedrichsfeld',
    'Ladenburg',
    'Heddesheim/Hirschberg',
    'Weinheim-Lützelsachsen',
    'Wünsdorf-Waldstadt',
    'Zossen',
    'Dabendorf',
    'Rangsdorf',
    'Dahlewitz',
    'Blankenfelde(Teltow-Fläming)',
    'Berlin-Schönefeld Flughafen',
    'Berlin Ostkreuz',
]

print('distance_shortest_path ', end='')
start = time.time()
for node in some_nodes:
    distance_shortest_path('Köln Hbf', node)
print('took', time.time() - start)

print('distance_nx ', end='')
start = time.time()
for node in some_nodes:
    distance_nx('Köln Hbf', node)
print('took', time.time() - start)

print('distance_path_then_sum ', end='')
start = time.time()
for node in some_nodes:
    distance_path_then_sum('Köln Hbf', node)
print('took', time.time() - start)

Which results in

distance_shortest_path took 46.34037733078003
distance_nx took 12.006148099899292
distance_path_then_sum took 0.9555535316467285
McToel
  • 131
  • 1
  • 6

1 Answers1

1

You can use the shortest_paths function for this in igraph. Using is quite straightforward, suppose that G is your graph, with G.es['weight'] edge weights, then

D = G.shortest_paths(weights='weight'))

will give you an igraph matrix D. You can convert this to a numpy array as

D = np.array(list(D))

To obtain the distance between only a specific pair of (sets of) nodes, you can specify the source and target arguments of shortest_paths.

Vincent Traag
  • 660
  • 4
  • 6
  • While this does work, it is about half as fast as the networkx function I also mentioned. I also found out by now, that it is about 40 times faster to calculate the path first using `get_shortest_paths` and then iterate over the edges in the path to find the weighted length of it. – McToel Mar 19 '21 at 22:39
  • Apologies, I misread the question to mean the distance between all pairs of nodes. You can specify a `source` and `target` node in the `shortest_paths` function if you want the distance between two (sets of) nodes. I will update my answer. – Vincent Traag Mar 20 '21 at 06:27
  • The results of my speedtest were already computed using source and target. Maybe `shortest_paths` is just way slower in my case. – McToel Mar 20 '21 at 12:17
  • That is somewhat unexpected. When I do this: `G = ig.Graph.Lattice([100], 3); G.es['weight'] = np.random.random(G.ecount())` and then use `%timeit` to test `G.shortest_paths(np.random.randint(G.vcount()), np.random.randint(G.vcount()), weights='weight')` and `sum(G.es[G.get_shortest_paths(np.random.randint(G.vcount()), np.random.randint(G.vcount()), weights='weight', output='epath')[0]]['weight']) ` this yields `26.8 µs ± 540 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)` and `30.1 µs ± 692 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)` respectively. – Vincent Traag Mar 20 '21 at 13:04
  • In comparison, when running `nx.shortest_path_length(H, np.random.randint(len(H)), np.random.randint(len(H)), weight='weight')` where `H` is the `networkx` graph as converted by `igraph`, I obtain `158 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)`. – Vincent Traag Mar 20 '21 at 13:11
  • I updated the question to include the code I used for testing. Could it be that `shortest_path` somehow has a bigger O then `get_shortest_paths`? – McToel Mar 20 '21 at 16:30