2

I'm trying to evaluate the probabilities of end locations of random walks but I'm having some trouble with the speed of my program. Basically what I'm trying to do is take as an input a dictionary that contains the probabilities for a random walk( e.g. p = {0:0.5, 1:0.2. -1:0.3} meaning there's a 50% probability X stays at 0, a 20% probability X increases by 1, and a 30% probability X decreases by 1) and then calculate the probabilities for all the possible future states after n iterations.

So for example if p = {0:0.5, 1:0.2. -1:0.3} and n = 2 then it will return {0:0.37, 1:0.2, -1:0.3, 2:0.04, -2:0.09} if p = {0:0.5, 1:0.2. -1:0.3} and n = 1 then it will return {0:0.5, 1:0.2. -1:0.3}

I have working code, and it runs relatively quickly if n is low and if the p dictionary is small, but when n > 500 and the dictionary has around 50 values it takes upwards of 5 minutes to calculate. I'm guessing this is because it does it only on one processor so I went ahead and modified it so it would use python's multiprocessing module (as I read that multithreading doesn't improve parallel computing performance because of GIL).

My problem is, that there is not much improvement with multiprocessing, now I'm not sure if it's because I'm implementing it wrong or because of the overhead of multiprocessing in python. I'm just wondering if there's a library somewhere that evaluates all the probabilities of all the possibilities of a random walk when n > 500 in parallel? My next step if I can't find anything is to write my own function as an extension in C but it will be my first time doing it and although I've coded in C before it has been a while.

Original Non MultiProcessed Code

def random_walk_predictor(probabilities_tree, period):
    ret = probabilities_tree
    probabilities_leaves = ret.copy()
    for x in range(period):
        tmp = {}
        for leaf in ret.keys():
            for tree_leaf in probabilities_leaves.keys():
                try:
                    tmp[leaf + tree_leaf] = (ret[leaf] * probabilities_leaves[tree_leaf]) + tmp[leaf + tree_leaf]
                except:
                    tmp[leaf + tree_leaf] = ret[leaf] * probabilities_leaves[tree_leaf]
        ret = tmp
    return ret

MultiProcessed code

from multiprocessing import Manager,Pool
from functools import partial

def probability_calculator(origin, probability, outp, reference):
    for leaf in probability.keys():
        try:
            outp[origin + leaf] = outp[origin + leaf] + (reference[origin] * probability[leaf])
        except KeyError:
            outp[origin + leaf] = reference[origin] * probability[leaf]

def random_walk_predictor(probabilities_leaves, period):
    probabilities_leaves = tree_developer(probabilities_leaves)
    manager = Manager()
    prob_leaves = manager.dict(probabilities_leaves)
    ret = manager.dict({0:1})
    p = Pool()

    for x in range(period):
        out = manager.dict()
        partial_probability_calculator = partial(probability_calculator, probability = prob_leaves, outp = out, reference = ret.copy())

        p.map(partial_probability_calculator, ret.keys())
        ret = out

    return ret.copy()
Luis F Hernandez
  • 891
  • 2
  • 14
  • 29

1 Answers1

2

There tend to be analytic solutions to exactly solve this kind of problem that look similar to binomial distributions, but I'll assume you're really asking for a computational solution for a more general class of problem.

Rather than using python dictionaries, it's easier to think about this in terms of the underlying mathematical problem. Build a matrix A that describes the probability of going from one state to another. Build a state x that describes the probability of being at a given location at some time.

Because after n transitions you can step at most n steps from the origin (in either direction) - your state needs to have 2n+1 rows, and A needs to be square and of size 2n+1 by 2n+1.

For a two timestep problem your transition matrix will be 5x5 and look like:

[[ 0.5  0.2  0.   0.   0. ]
 [ 0.3  0.5  0.2  0.   0. ]
 [ 0.   0.3  0.5  0.2  0. ]
 [ 0.   0.   0.3  0.5  0.2]
 [ 0.   0.   0.   0.3  0.5]]

And your state at time 0 will be:

[[ 0.]
 [ 0.]
 [ 1.]
 [ 0.]
 [ 0.]]

The one step evolution of the system can be predicted by multiplying A and x.

So at t = 1,

 x.T = [[ 0.   0.2  0.5  0.3  0. ]]

and at t = 2,

x.T = [[ 0.04  0.2   0.37  0.3   0.09]]

Because for even modest numbers of timesteps this is potentially going to take a fair bit of storage (A requires n^2 storage), but is very sparse, we can use sparse matrices to reduce our storage (and speed up our calculations). Doing this means A requires approximate 3n elements.

import scipy.sparse as sp
import numpy as np

def random_walk_transition_probability(n, left = 0.3, centre = 0.5, right = 0.2):
    m = 2*n+1
    A  = sp.csr_matrix((m, m))
    A += sp.diags(centre*np.ones(m), 0)
    A += sp.diags(left*np.ones(m-1), -1)
    A += sp.diags(right*np.ones(m-1),  1)
    x = np.zeros((m,1))
    x[n] = 1.0
    for i in xrange(n):
        x = A.dot(x)
    return x

print random_walk_transition_probability(4)

Timings

%timeit random_walk_transition_probability(500)
100 loops, best of 3: 7.12 ms per loop

%timeit random_walk_transition_probability(10000)
1 loops, best of 3: 1.06 s per loop
Andrew Walker
  • 40,984
  • 8
  • 62
  • 84
  • what if my p dict has more than just 0,1,-1?For example in the dataset I use for testing the p dict looks like [this](https://gist.githubusercontent.com/colmex/e8f0c1741f957832a1e5/raw/5167d5c15b55f39e7083c5434550765572649dbb/gistfile1.txt) as in it has around 50 values that aren't continuous. Would it just be a matter of iterating through them and doing the A+=sp.diags(p*np.ones(m-1), q) where p is the probability, and q is the value? So something like [this](https://gist.githubusercontent.com/colmex/c559e6d94f5e43efbd15/raw/970b3c5114c116e3b78d2d363d1c7185a78367f4/gistfile1.txt), make sense? – Luis F Hernandez Jul 18 '15 at 02:47
  • I modified it, because I did something wrong, but I just tested it modified, [this](https://gist.githubusercontent.com/colmex/c559e6d94f5e43efbd15/raw/e59b15b200b23b8d70f9b57c41297088428d5c2f/gistfile1.txt) is the modified code, and it works very quickly! This is really good thank you so much! I'm just curious as to why it's so much faster than the code I had? Because they essenially do the same operations, multiplying each leaf with all the probabilities and summing the result. Is it because dictionary look ups are slow? therefore since I'm doing a lot of them over and over it adds up? – Luis F Hernandez Jul 18 '15 at 06:10
  • The performance issues arose for a few reasons - python dictionaries are associative containers with amortized O(1), which at first glance looks similar to an array which has O(1) lookup. The difference occurs because a dictionary lookup (which occur on read/write to a slot) needs to hash the key before a read or write can be performed which means it will be slower than a container with direct addresing. This would be true in any language (ie/ the data structure wasn't the right choice) – Andrew Walker Jul 18 '15 at 07:55
  • The second part is that scipy is partially implemented in C and C++. In this case, the raw operations may or may not be implemented in C (scipy.sparse is a hybrid library and it's a bit hard to tell from the outside what's happening). However, from past experience I trust that the scipy developers have optimised this code as much as they can, and that it will require the fewest possible operations (be they in python or C++) to solve my problem. Trust your core library developers in your problem domain area. – Andrew Walker Jul 18 '15 at 08:12
  • Finally, matrix multiplications both dense and sparse are [very well understood](https://en.wikipedia.org/wiki/Matrix_multiplication#Algorithms_for_efficient_matrix_multiplication). Apart from those tricks, matrix multiplies are just additions and multiplications both of which are very fast and require few branches (ie/ this relates to both mathematics and processor architecture) – Andrew Walker Jul 18 '15 at 08:17
  • Irrespective of the performance implications, rephrasing the problem makes it easier to reason about, not to mention easier to extend/modify the algorithm used to find a solution - ie/ development time is more expensive than computing time. Consider looking at the sample problems on [Project Euler](https://projecteuler.net/) if you want to practice these types of skills. – Andrew Walker Jul 18 '15 at 08:37
  • awesome! thank's for the explanation and the answer! – Luis F Hernandez Jul 18 '15 at 15:22