2

Hi I'm trying to map an array of numbers to their ranks. So for example [2,5,3] would become [0,2,1].

I'm currently using np.where to lookup the rank in an array, but this is proving to take a very long time as I have to do this for a very large array (over 2 million datapoints).

If anyone has any suggestions on how I could achieve this, I'd greatly appreciate it!

[EDIT] This is what the code to change a specific row currently looks like:

def change_nodes(row): 
  a = row
  new_a = node_map[node_map[:,1] == a][0][0]
  return new_a

[EDIT 2] Duplicated numbers should additionally have the same rank

[EDIT 3] Additionally, unique numbers should only count once towards the ranking. So for example, the rankings for this list [2,3,3,4,5,7,7,7,7,8,1], would be:

{1:0, 2:1, 3:2, 4:3, 5:4, 7:5, 8:6 }

chris
  • 1,869
  • 4
  • 29
  • 52
  • Have you seen `list.sort()` and `list.index()`? – StardustGogeta May 01 '16 at 19:53
  • 1
    thanks, np.argsort was exactly what I needed! – chris May 01 '16 at 20:03
  • sorry, I also meant to add that if a number is repeated in the list, it needs to have the same rank each time. – chris May 01 '16 at 20:14
  • See my solution, it works in those cases as well. – StardustGogeta May 01 '16 at 20:15
  • 2
    Actually `np.argsort()` doesn't do what OP is really asking. The [doc](http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html) says it returns the indexes that would sort the array, and not the array of the ranks of the array's elements. Here the `[2,5,3]` example works the same and returns `[0, 2, 1]`, but if the example would have been `[5,2,3]`, `np.argsort()` would return `[1,2,0]`, instead of the array of ranks `[2,0,1]`. @StardustGogeta 's answer is the correct one in this instance. – ursan May 01 '16 at 21:11

3 Answers3

5

What you want to use is numpy.argsort:

>>> import numpy as np
>>> x = np.array([2, 5, 3])
>>> x.argsort()
array([0, 2, 1])

See this question and its answers for thoughts on adjusting how ties are handled.

Community
  • 1
  • 1
abcd
  • 10,215
  • 15
  • 51
  • 85
  • Shouldn't line 3 be `x = np.argsort(x)`? – StardustGogeta May 01 '16 at 20:31
  • @StardustGogeta not for my purposes, no. `x.argsort()` is the same as `np.argsort(x)`. and i didn't want to replace `x` with the sorted arguments. i just wanted to display the sorted arguments to the screen, to show that the answer is correct. i'd imagine the user of this answer would want to do something like `ranks = x.argsort()`. – abcd May 01 '16 at 20:33
  • Okay, I see what you mean. – StardustGogeta May 01 '16 at 20:34
2

I have a variant with only vanilla Python:

a = [2,5,3]
aSORT = list(a)
aSORT.sort()
for x in aSORT:
    a[a.index(x)] = aSORT.index(x)
print(a)

In my testing, the numpy version posted here took 0.1406 seconds to sort the list [2,5,3,62,5,2,5,1000,100,-1,-9] compared to only 0.0154 seconds with my method.

StardustGogeta
  • 3,331
  • 2
  • 18
  • 32
2

Here is an efficient solution and a comparison with the solution using index (the index solution is also not correct with the added (edit 3) restriction to the question)

import numpy as np

def rank1(x):
    # Sort values i = 0, 1, 2, .. using x[i] as key
    y = sorted(range(len(x)), key = lambda i: x[i])
    # Map each value of x to a rank. If a value is already associated with a
    # rank, the rank is updated. Iterate in reversed order so we get the
    # smallest rank for each value.
    rank = { x[y[i]]: i for i in xrange(len(y) -1, -1 , -1) }
    # Remove gaps in the ranks
    kv = sorted(rank.iteritems(), key = lambda p: p[1])
    for i in range(len(kv)):
        kv[i] = (kv[i][0], i)
    rank = { p[0]: p[1] for p in kv }
    # Pre allocate a array to fill with ranks
    r = np.zeros((len(x),), dtype=np.int)
    for i, v in enumerate(x):
        r[i] = rank[v]
    return r

def rank2(x):
    x_sorted = sorted(x)
    # creates a new list to preserve x
    rank = list(x)
    for v in x_sorted:
        rank[rank.index(v)] = x_sorted.index(v)
    return rank

Comparison results

>>> d = np.arange(1000)
>>> random.shuffle(d)
>>> %timeit rank1(d)
100 loops, best of 3: 1.97 ms per loop
>>> %timeit rank2(d)
1 loops, best of 3: 226 ms per loop

>>> d = np.arange(10000)
>>> random.shuffle(d)
>>> %timeit rank1(d)
10 loops, best of 3: 32 ms per loop
>>> %timeit rank2(d)
1 loops, best of 3: 24.4 s per loop

>>> d = np.arange(100000)
>>> random.shuffle(d)
>>> %timeit rank1(d)
1 loops, best of 3: 433 ms per loop

>>> d = np.arange(2000000)
>>> random.shuffle(d)
>>> %timeit rank1(d)
1 loops, best of 3: 11.2 s per loop

The problem with the index solution is that the time complexity is O(n^2). The time complexity of my solution is O(n lg n), that is, the sort time.

malbarbo
  • 10,717
  • 1
  • 42
  • 57
  • wait, rank1 actually just returns the original list? – chris May 01 '16 at 23:32
  • Sorry, that was a typo in copying the code. I fixed it. – malbarbo May 01 '16 at 23:44
  • also I meant to get the rank such that multiple copies of one rank do not affect the next ranking. So [2,3,3,4,1] would return [1,2,2,3,0] rather than [1,2,2,4,0] as this code does. Do you know how I could adapt it to make it return [1,2,2,3,0] ? – chris May 01 '16 at 23:49
  • Edit your answer, put an example, I will try to change the code. – malbarbo May 02 '16 at 00:00