3

I'm trying to obtain the ranks in a 2D array, along axis=1, with no repeated ranks.

Suppose I have the array below:

array([[4.32, 6.43, 4.32, 2.21],
       [0.65,  nan, 8.12, 6.43],
       [ nan, 4.32, 1.23, 1.23]])

I would expect the following result, for a 'hi-lo' rank:

array([[ 2.,  1.,  3.,  4.],
       [ 3., nan,  1.,  2.],
       [nan,  1.,  2.,  3.]])

And the following result, for a 'lo-hi' rank:

array([[ 2.,  4.,  3.,  1.],
       [ 1., nan,  3.,  2.],
       [nan,  3.,  1.,  2.]])

I've been using scipy.stats.rankdata but this solution is very time consuming (for large arrays). Also, the code I'm using (shown below) relies on np.apply_along_axis, which I know is not very efficient. I know that scipy.stats.rankdata accepts an axis argument but the code behind it uses exactly np.apply_along_axis (See here).

def f(array, order='hi-lo'):
    array = np.asarray(array)
    lo_hi_rank = np.apply_along_axis(rankdata, 1, array, 'ordinal')
    lo_hi_rank = lo_hi_rank.astype(float)
    lo_hi_rank[np.isnan(array)] = np.NaN
    if order == 'lo-hi':
        return lo_hi_rank
    else:
        return np.nanmax(lo_hi_rank, axis=1, keepdims=True) - lo_hi_rank + 1

Does anyone know a faster implementation?


Update

I've compared the execution time of all the options suggested so far.

Option 1 below is the explicit loop version of the code I suggested above (repeated below as Option 2)

def option1(a, order='ascending'):
    ranks = np.empty_like(a)
    for row in range(ranks.shape[0]):
        lo_hi_rank = rankdata(a[row], method='ordinal')
        lo_hi_rank = lo_hi_rank.astype(float)
        lo_hi_rank[np.isnan(a[row])] = np.NaN
        if order == 'ascending':
            ranks[row] = lo_hi_rank.copy()
        else:
            ranks[row] = np.nanmax(lo_hi_rank) - lo_hi_rank + 1
    return ranks

def option2(a, order='ascending'):
    a = np.asarray(a)
    lo_hi_rank = np.apply_along_axis(rankdata, 1, a, 'ordinal')
    lo_hi_rank = lo_hi_rank.astype(float)
    lo_hi_rank[np.isnan(a)] = np.NaN
    if order == 'ascending':
        return lo_hi_rank
    else:
        return np.nanmax(lo_hi_rank, axis=1, keepdims=True) - lo_hi_rank + 1

Options 3-6 were suggested by Divakar:

def option3(a, order='ascending'):
    na = np.isnan(a)
    sm = na.sum(1,keepdims=True)
    
    if order=='descending':
        b = np.where(np.isnan(a), -np.inf, -a)
    else:
        b = np.where(np.isnan(a), -np.inf,a)
    
    out = b.argsort(1,'stable').argsort(1)+1. - sm
    out[out<=0] = np.nan
    return out

def option4(a, order='ascending'):
    na = np.isnan(a)
    sm = na.sum(1,keepdims=True)

    if order=='descending':
        b = np.where(np.isnan(a), -np.inf, -a)
    else:
        b = np.where(np.isnan(a), -np.inf,a)

    idx = b.argsort(1,'stable')
    m,n = idx.shape
    sidx = np.empty((m,n), dtype=float)
    np.put_along_axis(sidx, idx,np.arange(1,n+1), axis=1)
    
    out = sidx - sm
    out[out<=0] = np.nan
    return out

def option5(a, order='descending'):
    b = -a if order=='descending' else a        
    out = b.argsort(1,'stable').argsort(1)+1.
    return np.where(np.isnan(a), np.nan, out)

def option6(a, order='descending'):
    b = -a if order=='descending' else a        
    idx = b.argsort(1,'stable')
    m,n = idx.shape
    out = np.empty((m,n), dtype=float)
    np.put_along_axis(out, idx,np.arange(1,n+1), axis=1)
    return np.where(np.isnan(a), np.nan, out)

Option 6 seems to be the cleanest and is indeed the fastest (~40% improvement vs Option 2). See below the average execution time for 100 iterations, with array.shape=(5348,1225)

>> TIME COMPARISON
>> 100 iterations | array.shape=(5348, 1225)

>> Option1: 0.4838 seconds
>> Option2: 0.3404 seconds
>> Option3: 0.3355 seconds
>> Option4: 0.2331 seconds
>> Option5: 0.3145 seconds
>> Option6: 0.2114 seconds

It can also be extend to generic axis and generic n-dim array, as proposed by Divakar. However, it is still too time consuming for what I'm trying to achieve (since I'll have to run this function millions of times within a loop). Is there a faster alternative? Or have we reached what's feasible with Python?

Mike
  • 102
  • 6

1 Answers1

1

Method #1

Here's one way -

def rank_with_nans(a, order='descending'):
    na = np.isnan(a)
    sm = na.sum(1,keepdims=True)
    
    if order=='descending':
        b = np.where(np.isnan(a), -np.inf, -a)
    else:
        b = np.where(np.isnan(a), -np.inf,a)
    
    out = b.argsort(1,'stable').argsort(1)+1. - sm
    out[out<=0] = np.nan
    return out

We can optimize on the double argsort part with a variation based on this post, shown below -

def rank_with_nans_v2(a, order='descending'):
    na = np.isnan(a)
    sm = na.sum(1,keepdims=True)
    
    if order=='descending':
        b = np.where(np.isnan(a), -np.inf, -a)
    else:
        b = np.where(np.isnan(a), -np.inf,a)

    idx = b.argsort(1,'stable')
    m,n = idx.shape
    sidx = np.empty((m,n), dtype=float)
    np.put_along_axis(sidx, idx,np.arange(1,n+1), axis=1)
    
    out = sidx - sm
    out[out<=0] = np.nan
    return out

Sample runs -

In [338]: a
Out[338]: 
array([[4.32, 6.43, 4.32, 2.21],
       [0.65,  nan, 8.12, 6.43],
       [ nan, 4.32, 1.23, 1.23]])

In [339]: rank_with_nans(a, order='descending')
Out[339]: 
array([[ 2.,  1.,  3.,  4.],
       [ 3., nan,  1.,  2.],
       [nan,  1.,  2.,  3.]])

In [340]: rank_with_nans(a, order='ascending')
Out[340]: 
array([[ 2.,  4.,  3.,  1.],
       [ 1., nan,  3.,  2.],
       [nan,  3.,  1.,  2.]])

Method #2

Without inf conversion, here's with double-argsort -

def rank_with_nans_v3(a, order='descending'):
    b = -a if order=='descending' else a        
    out = b.argsort(1,'stable').argsort(1)+1.
    return np.where(np.isnan(a), np.nan, out)

Again, with the argsort-skip trick -

def rank_with_nans_v4(a, order='descending'):
    b = -a if order=='descending' else a        
    idx = b.argsort(1,'stable')
    m,n = idx.shape
    out = np.empty((m,n), dtype=float)
    np.put_along_axis(out, idx,np.arange(1,n+1), axis=1)
    return np.where(np.isnan(a), np.nan, out)

Bonus : Extend to generic axis and generic n-dim array

We can extend the proposed solutions to incorporate axis so that the ranking could be applied along that axis. The last solution v4 seems would be the most efficient one. Let's use it to make it generic -

def rank_with_nans_along_axis(a, order='descending', axis=-1):
    b = -a if order=='descending' else a
    idx = b.argsort(axis=axis, kind='stable')
    out = np.empty(idx.shape, dtype=float)
    indexer = tuple([None if i!=axis else Ellipsis for i in range(a.ndim)])
    np.put_along_axis(out, idx, np.arange(1,a.shape[axis]+1, dtype=float)[indexer], axis=axis)
    return np.where(np.isnan(a), np.nan, out)

Sample run -

In [227]: a
Out[227]: 
array([[4.32, 6.43, 4.32, 2.21],
       [0.65,  nan, 8.12, 6.43],
       [ nan, 4.32, 1.23, 1.23]])

In [228]: rank_with_nans_along_axis(a, order='descending',axis=0)
Out[228]: 
array([[ 1.,  1.,  2.,  2.],
       [ 2., nan,  1.,  1.],
       [nan,  2.,  3.,  3.]])

In [229]: rank_with_nans_along_axis(a, order='ascending',axis=0)
Out[229]: 
array([[ 2.,  2.,  2.,  2.],
       [ 1., nan,  3.,  3.],
       [nan,  1.,  1.,  1.]])
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • this approach is excellent (~40% more efficient than my initial approach) - see the update on my original post. However, it is still too time consuming. Do you think it's possible to improve it further or is this as far as we can go with Python? – Mike Sep 28 '20 at 16:03