numpy cumulative count in linear time

Question

I have data like this: [144 144 144 144 143 143 143 93 93 93 93 93 93 93 93 93 93], and I want to make data like this: [0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, ....]

I tried to use the function in the link below, I got this: [3 2 1 0 2 1 0 9 0 6 5 4 3 2 1 7 8]

How can I fix it?

def grp_range(a):
    count = np.unique(a,return_counts=1)[1]
    idx = count.cumsum()
    id_arr = np.ones(idx[-1],dtype=int)
    id_arr[0] = 0
    id_arr[idx[:-1]] = -count[:-1]+1
    out = id_arr.cumsum()[np.argsort(a).argsort()]

    return out

How to use numpy to get the cumulative count by unique values in linear time?

score 2 · Answer 1 · answered Jul 15 '22 at 11:55

To speed up bpfrd version you can use numba. On my machine ~40 times faster as bpfrd version and ~10 times faster to ouroboros1 versions!

from numba import jit
import numpy as np

@jit(nopython=True)
def numba_style(a):
    prev = a[0]
    idx = 0
    c = [idx]

    for i in range(1, len(a)):
        if a[i] == prev:
            idx += 1
        else:
            prev = a[i]
            idx = 0
        c.append(idx) 

    return np.array(c)

Function	Timing (mean ± std. dev. of 3 runs, 3 loops each)
list_style	818 µs ± 116 µs per loop
grp_range	230 µs ± 81.3 µs per loop
cumcount	170 µs ± 73.9 µs per loop
np_style_2	165 µs ± 86.2 µs per loop
np_style	118 µs ± 79.3 µs per loop
numba_style	19.2 µs ± 887 ns per loop

Performance Testing

Speed comparison also to ouroboros1 versions. Define functions:

def list_style(a):
    prev = a[0]
    idx = 0
    c = [idx]

    for i in range(1, len(a)):
        if a[i] == prev:
            idx += 1
        else:
            prev = a[i]
            idx = 0
        c.append(idx) 

    return c

def dfill(a):
    n = a.size
    b = np.concatenate([[0], np.where(a[:-1] != a[1:])[0] + 1, [n]])
    return np.arange(n)[b[:-1]].repeat(np.diff(b))

def argunsort(s):
    n = s.size
    u = np.empty(n, dtype=np.int64)
    u[s] = np.arange(n)
    return u

def cumcount(a):
    n = a.size
    s = a.argsort(kind='mergesort')
    i = argunsort(s)
    b = a[s]
    return (np.arange(n) - dfill(b))[i]

def grp_range(a):
    count = np.unique(a,return_counts=1)[1]
    idx = count.cumsum()
    id_arr = np.ones(idx[-1],dtype=int)
    id_arr[0] = 0
    id_arr[idx[:-1]] = -count[:-1]+1
    out = id_arr.cumsum()[np.argsort(a, kind='mergesort').argsort()] # adjustment here
    return out

def np_style(a):
    unique, index, counts = np.unique(a, return_counts=True, return_index=True)
    arg_s = np.argsort(index)
    return np.concatenate(list(map(np.arange,counts[arg_s])), axis=0)

def np_style_2(a):
    aa = a+np.arange(len(a))
    unique, index = np.unique(a, return_index=True)
    for uni, ind in zip(unique, index):
        aa[a==uni] -= aa[ind]
    return aa

And the actual testing with a slightly longer array:

n =   4,   3,  9, 15, 24, 100, 2500, 555
t = 144, 143, 93, 85, 84, 100, 250, 555
a = np.concatenate([[t_i]*n_i for t_i, n_i in zip(t,n)])

%timeit -n 3 -r 3 grp_range(a)
%timeit -n 3 -r 3 np_style(a)
%timeit -n 3 -r 3 np_style_2(a)
%timeit -n 3 -r 3 cumcount(a)
%timeit -n 3 -r 3 list_style(a)
%timeit -n 3 -r 3 numba_style(a)

Nice! Should be pointed out, however, that this solution only works for a list/array where each number has its own sequence (e.g. `[144, 143, 93]`). So, it will work for the example supplied by the OP, but if the sequences are mixed (e.g. `[144, 143, 93, 144, 93]`), this method will get you a very fast, but wrong, result. — ouroboros1, Jul 15 '22 at 14:55

score 1 · Answer 2 · answered Jul 15 '22 at 07:11

try this:

a = [144, 144, 144, 144, 143, 143, 143, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93]
prev = a[0]
idx = 0
c = [idx]

for i in range(1, len(a)):
    if a[i] == prev:
        idx += 1
    else:
        prev = a[i]
        idx = 0
    c.append(idx) 

c
>[0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

ouroboros1 · Answer 3 · 2022-07-15T11:31:37.823

TL;DR: within your function change

out = id_arr.cumsum()[np.argsort(a).argsort()]

into

out = id_arr.cumsum()[np.argsort(a, kind='mergesort').argsort()]

If speed is a concern, use the solution offered by @piRSquared in the post mentioned. You'll need the first three functions mentioned. So:

import numpy as np

arr = np.array([144, 144, 144, 144, 143, 143, 143, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93])

def dfill(a):
    n = a.size
    b = np.concatenate([[0], np.where(a[:-1] != a[1:])[0] + 1, [n]])
    return np.arange(n)[b[:-1]].repeat(np.diff(b))

def argunsort(s):
    n = s.size
    u = np.empty(n, dtype=np.int64)
    u[s] = np.arange(n)
    return u

def cumcount(a):
    n = a.size
    s = a.argsort(kind='mergesort')
    i = argunsort(s)
    b = a[s]
    return (np.arange(n) - dfill(b))[i]

cumcount(arr) # will get you desired output

The accepted answer from the referenced post is actually incorrect.

The problem lies with the fact that np.argsort uses quicksort as the default sorting algorithm. For a stable sort, we need mergesort (see the comments by @MartijnPieters on the matter here).

So, in your slightly adjusted function we need:

import numpy as np

def grp_range(a):
    count = np.unique(a,return_counts=1)[1]
    idx = count.cumsum()
    id_arr = np.ones(idx[-1],dtype=int)
    id_arr[0] = 0
    id_arr[idx[:-1]] = -count[:-1]+1
    out = id_arr.cumsum()[np.argsort(a, kind='mergesort').argsort()] # adjustment here
    return out

Testing a couple of examples:

# OP's example
arr = np.array([144, 144, 144, 144, 143, 143, 143, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93])

arr_result = grp_range(arr)
print(arr_result)
# [0 1 2 3 0 1 2 0 1 2 3 4 5 6 7 8 9] (correct)

# OP's example, with mixed sequence (note addition 143, 144 at arr1[9:11])
arr_alt = np.array([144, 144, 144, 144, 143, 143, 143, 93, 93, 143, 144, 93, 93, 93, 93, 93, 93])

arr_alt_result = grp_range(arr_alt)
print(arr_alt_result)
# [0 1 2 3 0 1 2 0 1 3 4 2 3 4 5 6 7] (correct) (note: arr_alt_result[9:11] == array([3, 4], dtype=int32))

As mentioned above, the solution offered by @piRSquared will be faster than this with the same results.

A final aside. The sequence posted is in descending order. If this is true for the actual data you are working with, you could do something like this:

import numpy as np

arr = np.array([144, 144, 144, 144, 143, 143, 143, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93])

count = np.unique(arr,return_counts=1)[1][::-1] # from ascending to descending
out = np.concatenate(list(map(np.arange,count)), axis=0)

# out: [0 1 2 3 0 1 2 0 1 2 3 4 5 6 7 8 9]

or this:

from collections import Counter

arr = [144, 144, 144, 144, 143, 143, 143, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93]
count_dict = Counter(arr)

out = list()
for v in count_dict.items():
    out.extend([*range(v[1])])

or indeed, use the answer provided by @bpfrd.

numpy cumulative count in linear time

3 Answers3

Performance Testing