0

I am trying to run a function that is similar to Google's PageRank algorithm (for non-commercial purposes, of course). Here is the Python code; note that a[0] is the only thing that matters here, and a[0] contains an n x n matrix such as [[0,1,1],[1,0,1],[1,1,0]]. Also, you can find where I got this code from on Wikipedia:

def GetNodeRanks(a):        # graph, names, size
    numIterations = 10
    adjacencyMatrix = copy.deepcopy(a[0])
    b = [1]*len(adjacencyMatrix)
    tmp = [0]*len(adjacencyMatrix)
    for i in range(numIterations):
        for j in range(len(adjacencyMatrix)):
            tmp[j] = 0
            for k in range(len(adjacencyMatrix)):
                tmp[j] = tmp[j] + adjacencyMatrix[j][k] * b[k]
        norm_sq = 0
        for j in range(len(adjacencyMatrix)):
            norm_sq = norm_sq + tmp[j]*tmp[j]
        norm = math.sqrt(norm_sq)
        for j in range(len(b)):
            b[j] = tmp[j] / norm
    print b
    return b 

When I run this implementation (on a matrix much larger than a 3 x 3 matrix, n.b.), it does not yield enough precision to calculate the ranks in a way that allows me to compare them usefully. So I tried this instead:

from decimal import *

getcontext().prec = 5

def GetNodeRanks(a):        # graph, names, size
    numIterations = 10
    adjacencyMatrix = copy.deepcopy(a[0])
    b = [Decimal(1)]*len(adjacencyMatrix)
    tmp = [Decimal(0)]*len(adjacencyMatrix)
    for i in range(numIterations):
        for j in range(len(adjacencyMatrix)):
            tmp[j] = Decimal(0)
            for k in range(len(adjacencyMatrix)):
                tmp[j] = Decimal(tmp[j] + adjacencyMatrix[j][k] * b[k])
        norm_sq = Decimal(0)
        for j in range(len(adjacencyMatrix)):
            norm_sq = Decimal(norm_sq + tmp[j]*tmp[j])
        norm = Decimal(norm_sq).sqrt
        for j in range(len(b)):
            b[j] = Decimal(tmp[j] / norm)
    print b
    return b 

Even at this unhelpfully low precision, the code was extremely slow and never finished running in the time I sat waiting for it to run. Previously, the code was quick but insufficiently precise.

Is there a sensible/easy way to make the code run quickly and precisely at the same time?

emesday
  • 6,078
  • 3
  • 29
  • 46
  • 1
    What is in `a`? It's going to be basically impossible to optimize your code since you gave no expected inputs or expected outputs. – Two-Bit Alchemist Apr 16 '14 at 23:01
  • a[0] is the only thing that I'm operating on; it holds an n x n adjacency matrix. –  Apr 16 '14 at 23:02
  • For example, a[0] might hold: [[0,1,1],[1,0,1],[1,1,0]] –  Apr 16 '14 at 23:02
  • Edit that into your question as example input. Is it a normal list of lists or something created with a library like `numpy`? – Two-Bit Alchemist Apr 16 '14 at 23:03
  • It's a normal list of lists. I thought of using numpy; might that help? (Will edit in a moment.) –  Apr 16 '14 at 23:04
  • It might. I haven't personally used it. Also, please tag your question either `python-2.7` or `python-3.x` and say precisely which version of Python you are using. – Two-Bit Alchemist Apr 16 '14 at 23:05
  • One error I see is that you're repeatedly computing the square root for your `norm`, even though you only use the last value. Unindent that line by one level, so you only do it after the sum `norm_sq` has been fully computed. – Blckknght Apr 16 '14 at 23:06
  • That was a typo in my copy and pasting; I had it the way it is now in my original code in IDLE. –  Apr 16 '14 at 23:07
  • Reading about what you're doing on the Wikipedia page you linked, you really should be using numpy. For example, it already knows what a [matrix](http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html) is and can take a [dot product](http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html) of a matrix and vector. – Two-Bit Alchemist Apr 16 '14 at 23:19
  • I'll try numpy and come back shortly, thanks for the advice. –  Apr 16 '14 at 23:23

1 Answers1

0

Few tips for speeding up:

  • optimize code inside of loops
  • move all things out of inner loop up, if possible.
  • do not recompute, what is already known, use variables
  • do not do things, which are not necessary, skip them
  • consider using list comprehension, it is often a bit faster
  • stop optimizing as soon as it gets acceptable speed

Walking through your code:

from decimal import *

getcontext().prec = 5

def GetNodeRanks(a):        # graph, names, size
    # opt: pass in directly a[0], you do not use the rest
    numIterations = 10
    adjacencyMatrix = copy.deepcopy(a[0])
    #opt: why copy.deepcopy? You do not modify adjacencyMatric
    b = [Decimal(1)]*len(adjacencyMatrix)
    # opt: You often call Decimal(1) and Decimal(0), it takes some time
    # do it only once like
    # dec_zero = Decimal(0)
    # dec_one = Decimal(1)
    # prepare also other, repeatedly used data structures
    # len_adjacencyMatrix = len(adjacencyMatrix)
    # adjacencyMatrix_range = range(len_ajdacencyMatrix)
    # Replace code with pre-calculated variables yourself

    tmp = [Decimal(0)]*len(adjacencyMatrix)
    for i in range(numIterations):
        for j in range(len(adjacencyMatrix)):
            tmp[j] = Decimal(0)
            for k in range(len(adjacencyMatrix)):
                tmp[j] = Decimal(tmp[j] + adjacencyMatrix[j][k] * b[k])
        norm_sq = Decimal(0)
        for j in range(len(adjacencyMatrix)):
            norm_sq = Decimal(norm_sq + tmp[j]*tmp[j])
        norm = Decimal(norm_sq).sqrt #is this correct? I woudl expect .sqrt()
        for j in range(len(b)):
            b[j] = Decimal(tmp[j] / norm)
    print b
    return b 

Now few samples of how can be list processing optimized in Python.

Using sum, change:

        norm_sq = Decimal(0)
        for j in range(len(adjacencyMatrix)):
            norm_sq = Decimal(norm_sq + tmp[j]*tmp[j])

to:

        norm_sq = sum(val*val for val in tmp)

A bit of list comprehension:

Change:

        for j in range(len(b)):
            b[j] = Decimal(tmp[j] / norm)

change to:

    b = [Decimal(tmp_itm / norm) for tmp_itm in tmp]

If you get this coding style, you will be able optimizing the initial loops too and will probably find, that some of pre-calculated variables are becoming obsolete.

Jan Vlcinsky
  • 42,725
  • 12
  • 101
  • 98
  • That sped it up a lot! Thanks. Now my problem is an overflow error in the "reduce" line of code. I'll see if I can't figure that out. –  Apr 17 '14 at 21:05
  • Are you sure the reduce code is correct? It could have been my imagination, but when I tried the code it looked like it was giving me a different result for the eigenvector. –  Apr 17 '14 at 21:23
  • @PhilipWhite You are probably right. I think, the reduce code shall read `norm_sq = reduce(lambda a, b: a+b*b, tmp, Decimal(0))` otherwise it is squaring the original sum every time. Try it and if it works, correct it in my answer. – Jan Vlcinsky Apr 17 '14 at 21:25
  • @PhilipWhite When you are done, consider adding your final code into end of your question. – Jan Vlcinsky Apr 17 '14 at 21:27
  • Seems to be working much better! I'll edit your answer and accept momentarily. Thanks! –  Apr 18 '14 at 00:45
  • Can't edit--it says edits must be at least 6 characters. –  Apr 18 '14 at 00:46
  • @PhilipWhite I made the edit. The trick is always to invent a bit more changes, if you need to overcome the "at least 6 characters limit". If you have the code running and you are willing to share it, consider appending it to the end of your question with a header like "Edit: final solution" or something at that style. But feel free to skip this idea. – Jan Vlcinsky Apr 18 '14 at 01:15
  • 1
    You can use `sum` rather than `reduce`, and let Python take care of the adding for you: `norm_sq = sum(tmp[j]*tmp[j] for j in range(len(adjacencyMatrix)))`. You don't even need to start it off with a `Decimal` instance, since `0+Decimal(whatever)` will be a Decimal. There are lots of other places where you're calling `Decimal` where you probably don't need to. If either argument is already a `Decimal`, you can just to operations on it and the result will be a `Decimal` too. – Blckknght Apr 18 '14 at 01:46
  • @Blckknght good point with sum - I will add it to my example. – Jan Vlcinsky Apr 18 '14 at 07:47