2

I have a piece of code that loops through a set of nodes and counts the path length connecting the given node to each other node in my network. For each node my code returns me a list, b containing integer values giving me the path length for every possible connection. I want to count the number of occurences of given path lengths so I can create a histogram.

local_path_length_hist = {}
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    a = dist.a
    #Delete some erroneous entries
    b = a[a!=2147483647]
    for dist in b:
        if dist in local_path_length_hist:
            local_path_length_hist[dist]+=1
        else:
            local_path_length_hist[dist]=1

This presumably is very crude coding as far as the dictionary update is concerned. Is there a better way of doing this? What is the most efficient way of creating this histogram?

P-M
  • 1,279
  • 2
  • 21
  • 35

4 Answers4

1

The check that element exists in dict is not really necessary. You can just use collections.defaultdict. Its initialization accepts callable object (like function) that will be called if you want to access (or assign something to) element that does not exist to generate the value (i.e. function that generates default value). For your case, it can be just int. I.e.

import collections
local_path_length_hist = collections.defaultdict(int)
# you could say collections.defaultdict(lambda : 0) instead
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    a = dist.a
    #Delete some erroneous entries
    b = a[a!=2147483647]
    for dist in b:
        local_path_length_hist[dist] += 1

You could turn the last two lines in one like that, but there is really no point.

Dmitry Torba
  • 3,004
  • 1
  • 14
  • 24
  • I suspect I might agree with you about there being not that much point. It looks more elegant, agreed, but would it be much more efficient? – P-M Sep 02 '16 at 10:01
  • Yes, of course, that is just as efficient as checking if there is a key before assigning it. Defaultdict basically does just like that behind the scenes. And that is O(1) anyway, because dicts are implemented with hashes in Python – Dmitry Torba Sep 02 '16 at 14:49
1

Since gt.shortest_distance returns an ndarray, numpy math is fastest:

max_dist = len(vertices) - 1
hist_length = max_dist + 2
no_path_dist = max_dist + 1
hist = np.zeros(hist_length) 
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    hist += np.bincount(dist.a.clip(max=no_path_dist))

I use the ndarray method clip to bin the 2147483647 values returned by gt.shortest_distance at the last position of hist. Without use of clip, hist's size would have to be 2147483647 + 1 on 64-bit Python, or bincount would produce a ValueError on 32-bit Python. So the last position of hist will contain a count of all non-paths; you can ignore this value in your histogram analysis.


As the below timings indicate, using numpy math to obtain a histogram is well over an order of magnitude faster than using either defaultdicts or counters (Python 3.4):

# vertices      numpy    defaultdict    counter
    9000       0.83639    38.48990     33.56569
   25000       8.57003    314.24265    262.76025
   50000      26.46427   1303.50843   1111.93898

My computer is too slow to test with 9 * (10**6) vertices, but relative timings seem pretty consistent for varying number of vertices (as we would expect).


timing code:

from collections import defaultdict, Counter
import numpy as np
from random import randint, choice
from timeit import repeat

# construct distance ndarray such that:
# a) 1/3 of values represent no path
# b) 2/3 of values are a random integer value [0, (num_vertices - 1)]
num_vertices = 50000
no_path_length = 2147483647
distances = []
for _ in range(num_vertices):
    rand_dist = randint(0,(num_vertices-1))
    distances.append(choice((no_path_length, rand_dist, rand_dist)))
dist_a = np.array(distances)

def use_numpy_math():
    max_dist = num_vertices - 1
    hist_length = max_dist + 2
    no_path_dist = max_dist + 1
    hist = np.zeros(hist_length, dtype=np.int)
    for _ in range(num_vertices):
        hist += np.bincount(dist_a.clip(max=no_path_dist))

def use_default_dict():
    d = defaultdict(int)
    for _ in range(num_vertices):
        for dist in dist_a:
            d[dist] += 1

def use_counter():
    hist = Counter()
    for _ in range(num_vertices):
        hist.update(dist_a)

t1 = min(repeat(stmt='use_numpy_math()', setup='from __main__ import use_numpy_math',
                repeat=3, number=1))
t2 = min(repeat(stmt='use_default_dict()', setup='from __main__ import use_default_dict',
                repeat= 3, number=1))
t3 = min(repeat(stmt='use_counter()', setup='from __main__ import use_counter',
                repeat= 3, number=1))

print('%0.5f, %0.5f. %0.5f' % (t1, t2, t3))
Craig Burgler
  • 1,749
  • 10
  • 19
  • If there is no path between two vertices the code returns me a path length of `2147483647`. I have to remove them as it would otherwise blow up my averages. Are you sure the last modification would work? It strikes me that the code would not know what `a` is as we haven't assigned the variable yet or am I misunderstanding `numpy` here? (My knowledge of `numpy` is very limited. – P-M Sep 02 '16 at 10:00
  • It may be a Python understanding, not numpy. I will expand what I did in the answer. – Craig Burgler Sep 02 '16 at 10:37
  • edited answer treating `[a!=2147483647]` as a comment – Craig Burgler Sep 02 '16 at 11:00
  • Great, thank you. Would you expect a performance gain in doing this compared to my original version? Or is this largely optics? – P-M Sep 02 '16 at 11:03
  • My intent was optics and being Pythonic but `defaultdicts` may be faster than regular `dicts` for your data. I would suggest running benchmarks with and without them to test with your data. – Craig Burgler Sep 02 '16 at 11:38
  • Roughly many unique keys do you get with your data? What do these keys look like? I would like to run some quick benchmarks. – Craig Burgler Sep 02 '16 at 17:39
  • Hi Craig, good question (also excuse the late response). I have 9 million vertices and 5.2 million edges. So `dist.a` would contain 9 million entries. I am not sure how many of these will be invalid (i.e. equal `2147483647`) but the number of remaining values should still be large. I then repeat this process for each of the 9 million vertices. Does this help? – P-M Sep 16 '16 at 10:57
  • @P-M see my edited answer for a `numpy` approach which should be faster than those presented thus far ITT – Craig Burgler Sep 16 '16 at 13:52
  • Thank you, it returns the error `AttributeError: 'PropertyMap' object has no attribute 'clip'` though. Any thoughts why? – P-M Sep 19 '16 at 15:45
  • Problem solved. It needs to read `hist += np.bincount(dist.a.clip(max=no_path_dist))` – P-M Sep 19 '16 at 15:46
  • I can't make that edit as it apparently needs to be at least 6 characters in length. If you could carry it out then I will accept the answer. – P-M Sep 19 '16 at 16:01
1

There is a utility in the collections module called Counter. This is even cleaner than using a defaultdict(int)

from collections import Counter
hist = Counter()
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    a = dist.a
    #Delete some erroneous entries
    b = a[a!=2147483647]
    hist.update(b)
zstewart
  • 2,093
  • 12
  • 24
0

I think you can bypass this code entirely. Your question is tagged with . Take a look at this section of their documentation: graph_tool.stats.vertex_hist.

Excerpt from linked documentation:

graph_tool.stats.vertex_hist(g, deg, bins=[0, 1], float_count=True)
Return the vertex histogram of the given degree type or property.

Parameters:
g : Graph  Graph to be used.
deg : string or PropertyMap
 Degree or property to be used for the histogram. It can be either “in”, “out” or “total”, for in-,
 out-, or total degree of the vertices. It can also be a vertex property map.
bins : list of bins (optional, default: [0, 1])
 List of bins to be used for the histogram. The values given represent the edges of the bins
 (i.e. lower and upper bounds). If the list contains two values, this will be used to automatically
 create an appropriate bin range, with a constant width given by the second value, and starting
 from the first value.
float_count : bool (optional, default: True)
 If True, the counts in each histogram bin will be returned as floats. If False, they will be
 returned as integers.

Returns: counts : ndarray
 The bin counts.
bins : ndarray
 The bin edges.

This will return the edges grouped like a histogram in an ndarray. You can then just get the length of the ndarray columns to get your counts to generate the histogram.

skrrgwasme
  • 9,358
  • 11
  • 54
  • 84
  • That would be the ideal way, I agree. The problem I have though is that my network is too large to analyse it in one go - I run out of RAM - so I have to analyse it vertex by vertex. I could then get the vertex_hist for each vertex but I would still need to store it in a temporary variable to add the histogram for the other vertices. I am not sure if this would lead to much of a speedup? – P-M Sep 02 '16 at 09:56