Proposed approach
Let's bring some NumPy magic to the table! Well, we will exploit np.maximum.accumulate
.
Explanation
To see how maximum.accumulate
could help us, let's assume we have the groups lined up sequentially.
Let's consider a sample grouby :
grouby column : [0, 0, 0, 1, 1, 2, 2, 2, 2, 2]
Let's consider a sample value :
value column : [3, 1, 4, 1, 3, 3, 1, 5, 2, 4]
Using maximum.accumulate
simply on value
won't give us the desired output, as we need to do these accumulations only within each group. To do so, one trick would be to offset each group from the group before it.
There could be few methods to do that offsetting work. One easy way would be to offset each group with an offset of max of value
+ 1 more than the previous one. For the sample, that offset would be 6
. So, for the second group, we will add 6
, third one as 12
and so on. Thus, the modfied value
would be -
value column : [3, 1, 4, 7, 9, 15, 13, 17, 14, 16]
Now, we can use maximum.accumulate
and the accumulations would be restricted within each group -
value cummaxed: [3, 3, 4, 7, 9, 15, 15, 17, 17, 17])
To go back to the original values, subtract the offsets that were added before.
value cummaxed: [3, 3, 4, 1, 3, 3, 3, 5, 5, 5])
That's our desired result!
At the start, we assumed the groups to be sequential. To get the data in that format, we will use np.argsort(groupby,kind='mergesort')
to get the sorted indices such that it keeps the order for the same numbers and then use these indices to index into groupby
column.
To account for negative groupby elements, we just need to offset by max() - min()
rather than just max()
.
Thus, the implementation would look something like this -
def argsort_unique(idx):
# Original idea : http://stackoverflow.com/a/41242285/3293881
n = idx.size
sidx = np.empty(n,dtype=int)
sidx[idx] = np.arange(n)
return sidx
def numpy_cummmax(groupby, value, factorize_groupby=0):
# Inputs : 1D arrays.
# Get sorted indices keeping the order. Sort groupby and value cols.
sidx = np.argsort(groupby,kind='mergesort')
sorted_groupby, sorted_value = groupby[sidx], value[sidx]
if factorize_groupby==1:
sf = np.concatenate(([0], np.flatnonzero(sorted_groupby[1:] != \
sorted_groupby[:-1])+1, [sorted_groupby.size] ))
sorted_groupby = np.repeat(np.arange(sf.size-1), sf[1:] - sf[:-1])
# Get shifts to be used for shifting each group
mx = sorted_groupby.max()-sorted_groupby.min()+1
shifts = sorted_groupby*mx
# Shift and get max accumlate along value col.
# Those shifts helping out in keeping cummulative max within each group.
group_cummaxed = np.maximum.accumulate(shifts + sorted_value) - shifts
return group_cummaxed[argsort_unique(sidx)]
Runtime test and verification
Verification
1) Groupby as ints :
In [58]: # Setup with groupby as ints
...: LENGTH = 1000
...: g = np.random.randint(low=0, high=LENGTH/2, size=LENGTH)
...: v = np.random.rand(LENGTH)
...:
In [59]: df = pd.DataFrame(np.column_stack((g,v)),columns=['group', 'value'])
In [60]: # Verify results
...: out1 = df.groupby('group').cummax()
...: out2 = numpy_cummmax(df['group'].values, df['value'].values)
...: print np.allclose(out1.values.ravel(), out2, atol=1e-5)
...:
True
2) Groupby as floats :
In [10]: # Setup with groupby as floats
...: LENGTH = 100000
...: df = pd.DataFrame(np.random.randint(0,LENGTH//2,(LENGTH,2))/10.0, \
...: columns=['group', 'value'])
In [18]: # Verify results
...: out1 = df.groupby('group').cummax()
...: out2 = numpy_cummmax(df['group'].values, df['value'].values, factorize_groupby=1)
...: print np.allclose(out1.values.ravel(), out2, atol=1e-5)
...:
True
Timings -
1) Groupby as ints (same as the setup used for timings in the question) :
In [24]: LENGTH = 100000
...: g = np.random.randint(0,LENGTH//2,(LENGTH))/10.0
...: v = np.random.rand(LENGTH)
...:
In [25]: %timeit numpy(g, v) # Best solution from posted question
1 loops, best of 3: 373 ms per loop
In [26]: %timeit pir1(g, v) # @piRSquared's solution-1
1 loops, best of 3: 165 ms per loop
In [27]: %timeit pir2(g, v) # @piRSquared's solution-2
1 loops, best of 3: 157 ms per loop
In [28]: %timeit numpy_cummmax(g, v)
100 loops, best of 3: 18.3 ms per loop
2) Groupby as floats :
In [29]: LENGTH = 100000
...: g = np.random.randint(0,LENGTH//2,(LENGTH))/10.0
...: v = np.random.rand(LENGTH)
...:
In [30]: %timeit pir1(g, v) # @piRSquared's solution-1
1 loops, best of 3: 157 ms per loop
In [31]: %timeit pir2(g, v) # @piRSquared's solution-2
1 loops, best of 3: 156 ms per loop
In [32]: %timeit numpy_cummmax(g, v, factorize_groupby=1)
10 loops, best of 3: 20.8 ms per loop
In [34]: np.allclose(pir1(g, v),numpy_cummmax(g, v, factorize_groupby=1),atol=1e-5)
Out[34]: True