I'm trying to obtain the ranks in a 2D array, along axis=1, with no repeated ranks.
Suppose I have the array below:
array([[4.32, 6.43, 4.32, 2.21],
[0.65, nan, 8.12, 6.43],
[ nan, 4.32, 1.23, 1.23]])
I would expect the following result, for a 'hi-lo' rank:
array([[ 2., 1., 3., 4.],
[ 3., nan, 1., 2.],
[nan, 1., 2., 3.]])
And the following result, for a 'lo-hi' rank:
array([[ 2., 4., 3., 1.],
[ 1., nan, 3., 2.],
[nan, 3., 1., 2.]])
I've been using scipy.stats.rankdata
but this solution is very time consuming (for large arrays). Also, the code I'm using (shown below) relies on np.apply_along_axis
, which I know is not very efficient. I know that scipy.stats.rankdata
accepts an axis argument but the code behind it uses exactly np.apply_along_axis
(See here).
def f(array, order='hi-lo'):
array = np.asarray(array)
lo_hi_rank = np.apply_along_axis(rankdata, 1, array, 'ordinal')
lo_hi_rank = lo_hi_rank.astype(float)
lo_hi_rank[np.isnan(array)] = np.NaN
if order == 'lo-hi':
return lo_hi_rank
else:
return np.nanmax(lo_hi_rank, axis=1, keepdims=True) - lo_hi_rank + 1
Does anyone know a faster implementation?
Update
I've compared the execution time of all the options suggested so far.
Option 1 below is the explicit loop version of the code I suggested above (repeated below as Option 2)
def option1(a, order='ascending'):
ranks = np.empty_like(a)
for row in range(ranks.shape[0]):
lo_hi_rank = rankdata(a[row], method='ordinal')
lo_hi_rank = lo_hi_rank.astype(float)
lo_hi_rank[np.isnan(a[row])] = np.NaN
if order == 'ascending':
ranks[row] = lo_hi_rank.copy()
else:
ranks[row] = np.nanmax(lo_hi_rank) - lo_hi_rank + 1
return ranks
def option2(a, order='ascending'):
a = np.asarray(a)
lo_hi_rank = np.apply_along_axis(rankdata, 1, a, 'ordinal')
lo_hi_rank = lo_hi_rank.astype(float)
lo_hi_rank[np.isnan(a)] = np.NaN
if order == 'ascending':
return lo_hi_rank
else:
return np.nanmax(lo_hi_rank, axis=1, keepdims=True) - lo_hi_rank + 1
Options 3-6 were suggested by Divakar:
def option3(a, order='ascending'):
na = np.isnan(a)
sm = na.sum(1,keepdims=True)
if order=='descending':
b = np.where(np.isnan(a), -np.inf, -a)
else:
b = np.where(np.isnan(a), -np.inf,a)
out = b.argsort(1,'stable').argsort(1)+1. - sm
out[out<=0] = np.nan
return out
def option4(a, order='ascending'):
na = np.isnan(a)
sm = na.sum(1,keepdims=True)
if order=='descending':
b = np.where(np.isnan(a), -np.inf, -a)
else:
b = np.where(np.isnan(a), -np.inf,a)
idx = b.argsort(1,'stable')
m,n = idx.shape
sidx = np.empty((m,n), dtype=float)
np.put_along_axis(sidx, idx,np.arange(1,n+1), axis=1)
out = sidx - sm
out[out<=0] = np.nan
return out
def option5(a, order='descending'):
b = -a if order=='descending' else a
out = b.argsort(1,'stable').argsort(1)+1.
return np.where(np.isnan(a), np.nan, out)
def option6(a, order='descending'):
b = -a if order=='descending' else a
idx = b.argsort(1,'stable')
m,n = idx.shape
out = np.empty((m,n), dtype=float)
np.put_along_axis(out, idx,np.arange(1,n+1), axis=1)
return np.where(np.isnan(a), np.nan, out)
Option 6 seems to be the cleanest and is indeed the fastest (~40% improvement vs Option 2). See below the average execution time for 100 iterations, with array.shape=(5348,1225)
>> TIME COMPARISON
>> 100 iterations | array.shape=(5348, 1225)
>> Option1: 0.4838 seconds
>> Option2: 0.3404 seconds
>> Option3: 0.3355 seconds
>> Option4: 0.2331 seconds
>> Option5: 0.3145 seconds
>> Option6: 0.2114 seconds
It can also be extend to generic axis and generic n-dim array, as proposed by Divakar. However, it is still too time consuming for what I'm trying to achieve (since I'll have to run this function millions of times within a loop). Is there a faster alternative? Or have we reached what's feasible with Python?