9

Consider the following lists short_list and long_list

short_list = list('aaabaaacaaadaaac')
np.random.seed([3,1415])
long_list = pd.DataFrame(
    np.random.choice(list(ascii_letters),
                     (10000, 2))
).sum(1).tolist()

How do I calculate the cumulative count by unique value?

I want to use numpy and do it in linear time. I want this to compare timings with my other methods. It may be easiest to illustrate with my first proposed solution

def pir1(l):
    s = pd.Series(l)
    return s.groupby(s).cumcount().tolist()

print(np.array(short_list))
print(pir1(short_list))

['a' 'a' 'a' 'b' 'a' 'a' 'a' 'c' 'a' 'a' 'a' 'd' 'a' 'a' 'a' 'c']
[0, 1, 2, 0, 3, 4, 5, 0, 6, 7, 8, 0, 9, 10, 11, 1]

I've tortured myself trying to use np.unique because it returns a counts array, an inverse array, and an index array. I was sure I could these to get at a solution. The best I got is in pir4 below which scales in quadratic time. Also note that I don't care if counts start at 1 or zero as we can simply add or subtract 1.

Below are some of my attempts (none of which answer my question)

%%cython
from collections import defaultdict

def get_generator(l):
    counter = defaultdict(lambda: -1)
    for i in l:
        counter[i] += 1
        yield counter[i]

def pir2(l):
    return [i for i in get_generator(l)]

def pir3(l):
    return [i for i in get_generator(l)]

def pir4(l):
    unq, inv = np.unique(l, 0, 1, 0)
    a = np.arange(len(unq))
    matches = a[:, None] == inv
    return (matches * matches.cumsum(1)).sum(0).tolist()

enter image description here

enter image description here

Ébe Isaac
  • 11,563
  • 17
  • 64
  • 97
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • What's wrong with ```pir2``` - it makes a single pass over the list?? – wwii Nov 15 '16 at 05:10
  • @wwii I like `pir2`! It's the best I've found. I just imagined that a single pass using numpy slicing would be faster. I can't compare timings if I can't figure out a method. – piRSquared Nov 15 '16 at 05:11
  • 2
    Note that a solution that uses `numpy.unique` is O(n*log(n)), because the current implementation of `numpy.unique` sorts its argument. – Warren Weckesser Nov 15 '16 at 05:54
  • You might be able to shave a bit off ```pir2``` by ditching the generator function and list comprehension and just iterate over the list, add each item to the counter then append the counter value to a list which you return. – wwii Nov 15 '16 at 06:02
  • @WarrenWeckesser ty – piRSquared Nov 15 '16 at 06:20

3 Answers3

7

setup

short_list = np.array(list('aaabaaacaaadaaac'))

functions

  • dfill takes an array and returns the positions where the array changes and repeats that index position until the next change.

    # dfill
    # 
    # Example with short_list
    #
    #    0  0  0  3  4  4  4  7  8  8  8 11 12 12 12 15
    # [  a  a  a  b  a  a  a  c  a  a  a  d  a  a  a  c]
    #
    # Example with short_list after sorting
    #
    #    0  0  0  0  0  0  0  0  0  0  0  0 12 13 13 15
    # [  a  a  a  a  a  a  a  a  a  a  a  a  b  c  c  d]
    
  • argunsort returns the permutation necessary to undo a sort given the argsort array. The existence of this method became know to me via this post.. With this, I can get the argsort array and sort my array with it. Then I can undo the sort without the overhead of sorting again.
  • cumcount will take an array sort it, find the dfill array. An np.arange less dfill will give me cumulative count. Then I un-sort

    # cumcount
    # 
    # Example with short_list
    #
    # short_list:
    # [ a  a  a  b  a  a  a  c  a  a  a  d  a  a  a  c]
    # 
    # short_list.argsort():
    # [ 0  1  2  4  5  6  8  9 10 12 13 14  3  7 15 11]
    #
    # Example with short_list after sorting
    #
    # short_list[short_list.argsort()]:
    # [ a  a  a  a  a  a  a  a  a  a  a  a  b  c  c  d]
    # 
    # dfill(short_list[short_list.argsort()]):
    # [ 0  0  0  0  0  0  0  0  0  0  0  0 12 13 13 15]
    # 
    # np.range(short_list.size):
    # [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
    #
    # np.range(short_list.size) - 
    #     dfill(short_list[short_list.argsort()]):
    # [ 0  1  2  3  4  5  6  7  8  9 10 11  0  0  1  0]
    # 
    # unsorted:
    # [ 0  1  2  0  3  4  5  0  6  7  8  0  9 10 11  1]
    
  • foo function recommended by @hpaulj using defaultdict
  • div function recommended by @Divakar (old, I'm sure he'd update it)

code

def dfill(a):
    n = a.size
    b = np.concatenate([[0], np.where(a[:-1] != a[1:])[0] + 1, [n]])
    return np.arange(n)[b[:-1]].repeat(np.diff(b))

def argunsort(s):
    n = s.size
    u = np.empty(n, dtype=np.int64)
    u[s] = np.arange(n)
    return u

def cumcount(a):
    n = a.size
    s = a.argsort(kind='mergesort')
    i = argunsort(s)
    b = a[s]
    return (np.arange(n) - dfill(b))[i]

def foo(l):
    n = len(l)
    r = np.empty(n, dtype=np.int64)
    counter = defaultdict(int)
    for i in range(n):
        counter[l[i]] += 1
        r[i] = counter[l[i]]
    return r - 1

def div(l):
    a = np.unique(l, return_counts=1)[1]
    idx = a.cumsum()
    id_arr = np.ones(idx[-1],dtype=int)
    id_arr[0] = 0
    id_arr[idx[:-1]] = -a[:-1]+1
    rng = id_arr.cumsum()
    return rng[argunsort(np.argsort(l))]

demonstration

cumcount(short_list)

array([ 0,  1,  2,  0,  3,  4,  5,  0,  6,  7,  8,  0,  9, 10, 11,  1])

time testing

code

functions = pd.Index(['cumcount', 'foo', 'foo2', 'div'], name='function')
lengths = pd.RangeIndex(100, 1100, 100, 'array length')
results = pd.DataFrame(index=lengths, columns=functions)

from string import ascii_letters

for i in lengths:
    a = np.random.choice(list(ascii_letters), i)
    for j in functions:
        results.set_value(
            i, j,
            timeit(
                '{}(a)'.format(j),
                'from __main__ import a, {}'.format(j),
                number=1000
            )
        )

results.plot()

enter image description here

Community
  • 1
  • 1
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • I need to research a bit more maybe on this and see if there's anymore improvement possible, but this looks good! – Divakar Jan 10 '17 at 17:34
6

Here's a vectorized approach using custom grouped range creating function and np.unique for getting the counts -

def grp_range(a):
    idx = a.cumsum()
    id_arr = np.ones(idx[-1],dtype=int)
    id_arr[0] = 0
    id_arr[idx[:-1]] = -a[:-1]+1
    return id_arr.cumsum()

count = np.unique(A,return_counts=1)[1]
out = grp_range(count)[np.argsort(A).argsort()]

Sample run -

In [117]: A = list('aaabaaacaaadaaac')

In [118]: count = np.unique(A,return_counts=1)[1]
     ...: out = grp_range(count)[np.argsort(A).argsort()]
     ...: 

In [119]: out
Out[119]: array([ 0,  1,  2,  0,  3,  4,  5,  0,  6,  7,  8,  0,  9, 10, 11,  1])

For getting the count, few other alternatives could be proposed with focus on performance -

np.bincount(np.unique(A,return_inverse=1)[1])
np.bincount(np.fromstring('aaabaaacaaadaaac',dtype=np.uint8)-97)

Additionally, with A containing single-letter characters, we could get the count simply with -

np.bincount(np.array(A).view('uint8')-97)
Divakar
  • 218,885
  • 19
  • 262
  • 358
4

Besides defaultdict there are a couple of other counters. Testing a slightly simpler case:

In [298]: from collections import defaultdict
In [299]: from collections import defaultdict, Counter
In [300]: def foo(l):
     ...:     counter = defaultdict(int)
     ...:     for i in l:
     ...:         counter[i] += 1
     ...:     return counter
     ...: 
In [301]: short_list = list('aaabaaacaaadaaac')
In [302]: foo(short_list)
Out[302]: defaultdict(int, {'a': 12, 'b': 1, 'c': 2, 'd': 1})
In [303]: Counter(short_list)
Out[303]: Counter({'a': 12, 'b': 1, 'c': 2, 'd': 1})
In [304]: arr=[ord(i)-ord('a') for i in short_list]
In [305]: np.bincount(arr)
Out[305]: array([12,  1,  2,  1], dtype=int32)

I constructed arr because bincount only works with ints.

In [306]: timeit np.bincount(arr)
The slowest run took 82.46 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.63 µs per loop
In [307]: timeit Counter(arr)
100000 loops, best of 3: 13.6 µs per loop
In [308]: timeit foo(arr)
100000 loops, best of 3: 6.49 µs per loop

I'm guessing it would hard to improve on pir2 based on default_dict.

Searching and counting like this are not a strong area for numpy.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • these all return the final counts. each of my functions return a list or array of equal length as the original where each element in the returned list is the count of the corresponding element so far in the source list. For straight counts, we numpy also has `np.unique(short_list, return_counts=True)[1]` – piRSquared Nov 15 '16 at 08:03
  • Yes, I chose not to take that extra step in the comparison. It was easier that way. – hpaulj Nov 15 '16 at 10:10
  • I've added an answer with some updated information if you're interested. – piRSquared Jan 09 '17 at 23:17