27

I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. Can anyone else better the benchmark using Python and preferably NumPy?

def group():
    import numpy as np
    from itertools import groupby
    values = np.array(np.random.randint(0,1<<32, size=35000000), dtype='u4')
    values.sort()
    groups = ((k, len(list(g))) for k,g in groupby(values))
    index = np.fromiter(groups, dtype='u4,u2')

if __name__=='__main__':
    from timeit import Timer
    t = Timer("group()","from __main__ import group")
    print t.timeit(number=1)

which returns:

$ python bench.py
111.377498865

Based on responses:

def group_original():
    import numpy as np
    from itertools import groupby
    values = np.array(np.random.randint(0, 1<<32, size=35000000), dtype='u4')
    values.sort()
    groups = ((k, len(list(g))) for k,g in groupby(values))
    index = np.fromiter(groups, dtype='u4,u2')

def group_gnibbler():
    import numpy as np
    from itertools import groupby
    values = np.array(np.random.randint(0, 1<<32, size=35000000), dtype='u4')
    values.sort()
    groups = ((k,sum(1 for i in g)) for k,g in groupby(values))
    index = np.fromiter(groups, dtype='u4,u2')

def group_christophe():
    import numpy as np
    values = np.array(np.random.randint(0, 1<<32, size=35000000), dtype='u4')
    values.sort()
    counts=values.searchsorted(values, side='right') - values.searchsorted(values, side='left')
    index = np.zeros(len(values), dtype='u4,u2')
    index['f0'] = values
    index['f1'] = counts
    # Erroneous result!

def group_paul():
    import numpy as np
    values = np.array(np.random.randint(0, 1<<32, size=35000000), dtype='u4')
    values.sort()
    diff = np.concatenate(([1], np.diff(values)))
    idx = np.concatenate((np.where(diff)[0], [len(values)]))
    index = np.empty(len(idx)-1, dtype='u4,u2')
    index['f0'] = values[idx[:-1]]
    index['f1'] = np.diff(idx)

if __name__=='__main__':
    from timeit import Timer
    timings=[
                ("group_original", "Original"),
                ("group_gnibbler", "Gnibbler"),
                ("group_christophe", "Christophe"),
                ("group_paul", "Paul"),
            ]
    for method,title in timings:
        t = Timer("%s()"%method,"from __main__ import %s"%method)
        print "%s: %s secs"%(title, t.timeit(number=1))

which returns:

$ python bench.py
Original: 113.385262966 secs
Gnibbler: 71.7464978695 secs
Christophe: 27.1690568924 secs
Paul: 9.06268405914 secs

Although Christophe gives incorrect results currently.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Donny
  • 305
  • 1
  • 3
  • 7
  • Are there any restrictions to the range of possible integers? Can all 2^32 possible integers occur? – Sven Marnach Jan 10 '11 at 21:41
  • Do you need the output of `group()` sorted by key? – Sven Marnach Jan 10 '11 at 21:53
  • Hi Sven, there is an equal chance of each 2^32 integer occurring, and the grouped output (i.e. index) does need to be in ascending order. The values.sort() isn't really the bottleneck, it's the last line of group() which is the slow bit! Cheers! – Donny Jan 10 '11 at 22:22
  • 1
    If you just want to get the frequency count of integers, np.bincount is doing it. np.bincount returns the count for all integers in range(max(value)) even for zero counts, which might not be what you want, but it's fast. – Josef Jan 10 '11 at 22:54
  • Hi user333700. I think the range of value between 0 and 2^32 means that bincount would use more memory than most computers possess! – Donny Jan 11 '11 at 01:19
  • Why would I get "OverflowError: Python int too large to convert to C long" when I run the code above? I am using Python from Anaconda. – Ken T May 15 '15 at 03:23

10 Answers10

32

I get a three times improvement doing something like this:

def group():
    import numpy as np
    values = np.array(np.random.randint(0, 3298, size=35000000), dtype='u4')
    values.sort()
    dif = np.ones(values.shape, values.dtype)
    dif[1:] = np.diff(values)
    idx = np.where(dif>0)
    vals = values[idx]
    count = np.diff(idx)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Paul
  • 42,322
  • 15
  • 106
  • 123
  • Thanks for the answer! Runs very fast, but len(vals)==len(count[0]) returns False at the end, which seems like an error? – Donny Jan 11 '11 at 01:18
  • Yes, it's a slight error. `count` is missing the final value. count should be: `np.diff(inx.tolist+[len(values)])` or something less ugly. – Paul Jan 11 '11 at 03:00
  • 2
    The error can be fixed by changing using the lines `idx = np.concatenate((np.where(dif)[0],[len(values)]))` and `vals = values[idx[:-1]]` in place of the 3rd and 2nd to last lines respectively. This is really the best answer using just numpy. If you wanted it faster, I would recommend using Cython. It would be pretty easy to do and give a pretty significant improvement in speed and memory over this numpy code. – Justin Peel Jan 11 '11 at 06:37
  • @bpowah and @Justin have incorporated your fixes into the original post and this solution is 11x faster than the original code!!! – Donny Jan 11 '11 at 12:30
  • @Justin have you got a cython example you could share? – Donny Jan 11 '11 at 18:22
  • Extremely nice (runs in less then 5 seconds on my machine ;-) – ChristopheD Jan 11 '11 at 19:11
  • @Donny, I posted a Cython solution, but if I had thought about it more I would have realized that it wasn't very worthwhile. Sorting is taking most of the time here. – Justin Peel Jan 14 '11 at 13:47
  • any chance we could get an explanation on the concepts above? – WestCoastProjects May 18 '19 at 05:01
14

More than 5 years have passed since Paul's answer was accepted. Interestingly, the sort() is still the bottleneck in the accepted solution.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           @profile
     4                                           def group_paul():
     5         1        99040  99040.0      2.4      import numpy as np
     6         1       305651 305651.0      7.4      values = np.array(np.random.randint(0, 2**32,size=35000000),dtype='u4')
     7         1      2928204 2928204.0    71.3      values.sort()
     8         1        78268  78268.0      1.9      diff = np.concatenate(([1],np.diff(values)))
     9         1       215774 215774.0      5.3      idx = np.concatenate((np.where(diff)[0],[len(values)]))
    10         1           95     95.0      0.0      index = np.empty(len(idx)-1,dtype='u4,u2')
    11         1       386673 386673.0      9.4      index['f0'] = values[idx[:-1]]
    12         1        91492  91492.0      2.2      index['f1'] = np.diff(idx)

The accepted solution runs for 4.0 s on my machine, with radix sort it drops down to 1.7 s.

Just by switching to radix sort, I get an overall 2.35x speedup. The radix sort is more than 4x faster than quicksort in this case.

See How to sort an array of integers faster than quicksort? that was motivated by your question.


For the profiling I used line_profiler and kernprof (the @profile comes from there).

Ali
  • 56,466
  • 29
  • 168
  • 265
  • Sounds very much like numpy needs a radix sort patch :-) – Donny Feb 11 '16 at 14:43
  • @Donny Yes. In fact, there is already an open issue: https://github.com/numpy/numpy/issues/6050 I will post my findings there, it is easy to grab this radix sort and put it into numpy. – Ali Feb 11 '16 at 14:52
  • 1
    What package are you using for the `@profile` decorator? Looks very nice. – danijar Sep 01 '16 at 01:52
  • 2
    @danijar Yeah, it's pretty cool. :) It is the [line_profiler and kernprof](https://github.com/rkern/line_profiler). – Ali Sep 01 '16 at 07:31
5

By request, here is a Cython version of this. I did two passes through the array. The first one finds out how many unique elements there are so that can my arrays for the unique values and counts of the appropriate size.

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
def dogroup():
    cdef unsigned long tot = 1
    cdef np.ndarray[np.uint32_t, ndim=1] values = np.array(np.random.randint(35000000,size=35000000),dtype=np.uint32)
    cdef unsigned long i, ind, lastval
    values.sort()
    for i in xrange(1,len(values)):
        if values[i] != values[i-1]:
            tot += 1
    cdef np.ndarray[np.uint32_t, ndim=1] vals = np.empty(tot,dtype=np.uint32)
    cdef np.ndarray[np.uint32_t, ndim=1] count = np.empty(tot,dtype=np.uint32)
    vals[0] = values[0]
    ind = 1
    lastval = 0
    for i in xrange(1,len(values)):
        if values[i] != values[i-1]:
            vals[ind] = values[i]
            count[ind-1] = i - lastval
            lastval = i
            ind += 1
    count[ind-1] = len(values) - lastval

The sorting is actually taking the most time here by far. Using the values array given in my code, the sorting is taking 4.75 seconds and the actual finding of the unique values and counts takes .67 seconds. With the pure Numpy code using Paul's code (but with the same form of the values array) with the fix I suggested in a comment, finding the unique values and counts takes 1.9 seconds (sorting still takes the same amount of time of course).

It makes sense for most of the time to be taken up by the sorting because it is O(N log N) and the counting is O(N). You can speed up the sort a little bit over Numpy's (which uses C's qsort if I remember correctly), but you have to really know what you are doing and it probably isn't worthwhile. Also, there might be some way to speed up my Cython code a little bit more, but it probably isn't worthwhile.

Paul
  • 42,322
  • 15
  • 106
  • 123
Justin Peel
  • 46,722
  • 6
  • 58
  • 80
3

I guess the most obvious and still not mentioned approach is, to simply use collections.Counter. Instead of building a huge amount of temporarily used lists with groupby, it just upcounts integers. It's a oneliner and a 2-fold speedup, but still slower than the pure numpy solutions.

def group():
    import sys
    import numpy as np
    from collections import Counter
    values = np.array(np.random.randint(0,sys.maxint,size=35000000),dtype='u4')
    c = Counter(values)

if __name__=='__main__':
    from timeit import Timer
    t = Timer("group()","from __main__ import group")
    print t.timeit(number=1)

I get a speedup from 136 s to 62 s for my machine, compared to the initial solution.

Michael
  • 7,316
  • 1
  • 37
  • 63
2

This is a fairly old thread, but I thought I'd mention that there's a small improvement to be made on the currently-accepted solution:

def group_by_edge():
    import numpy as np
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
    values.sort()
    edges = (values[1:] != values[:-1]).nonzero()[0] - 1
    idx = np.concatenate(([0], edges, [len(values)]))
    index = np.empty(len(idx) - 1, dtype= 'u4, u2')
    index['f0'] = values[idx[:-1]]
    index['f1'] = np.diff(idx)

This tested as about a half-second faster on my machine; not a huge improvement, but worth something. Additionally, I think it's clearer what's happening here; the two step diff approach is a bit opaque at first glance.

senderle
  • 145,869
  • 36
  • 209
  • 233
2

This is a numpy solution:

def group():
    import numpy as np
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')

    # we sort in place
    values.sort()

    # when sorted the number of occurences for a unique element is the index of 
    # the first occurence when searching from the right - the index of the first
    # occurence when searching from the left.
    #
    # np.dstack() is the numpy equivalent to Python's zip()

    l = np.dstack((values, values.searchsorted(values, side='right') - \
                   values.searchsorted(values, side='left')))

    index = np.fromiter(l, dtype='u4,u2')

if __name__=='__main__':
    from timeit import Timer
    t = Timer("group()","from __main__ import group")
    print t.timeit(number=1)

Runs in about 25 seconds on my machine compared to about 96 for your initial solution (which is a nice improvement).

There might be still room for improvement, I don't use numpy that often.

Edit: added some comments in code.

ChristopheD
  • 112,638
  • 29
  • 165
  • 179
  • Hi Christophe! I think the right left search subtract logic is fantastic! Have benchmarked in the original question and massaged the output into a structured array for memory conservation. – Donny Jan 11 '11 at 01:29
  • @Donny, I'm not sure that this is what you want. For instance, if `values` is equal to [2,2,3] then you will get `array([[[2,2],[2,2],[3,1]]])` for `l` and `array([(2L, 2)], dtype=[('f0',' – Justin Peel Jan 11 '11 at 06:33
  • @Justin you're correct, the results from this method are erroneous! However, if I do the searchsorted on the np.unique(values) then you get [1,1,1,1,1,1,.....] which is erroneous too! – Donny Jan 11 '11 at 11:37
  • @Donny, those might actually be the correct results ([1,1,1,1,1,1...]) because you are choosing 35000000 values from the range of 0 to 4294967296. Your range is more than 122 times larger than the number of values you are selecting. The chances aren't especially high that you'll get a repeated integer in that circumstance. – Justin Peel Jan 11 '11 at 16:09
  • @Justin, np.unique(values) will always return ([1,1,1,1,1,....] because it only returns a single instance of a value. Please submit some code in a separate answer and we an try it out :) – Donny Jan 11 '11 at 16:40
  • This code should work correctly. The only problem is that it does indeed not account for removing duplicates. When you start with `[1,1,1,2,2,1]` you'll get `[(1, 4), (1, 4), (1, 4), (2, 2), (2, 2), (1, 4)]` which represents the correct frequency count but needs duplicates removed. – ChristopheD Jan 11 '11 at 18:30
  • @ChristopheD, you can fix it by doing something like `uv = np.unique(values)` then `l = np.dstack((uv, values.searchsorted(uv, side='right') - values.searchsorted(uv, side='left')))` – Justin Peel Jan 14 '11 at 12:58
  • @ChristopheD @Donny, yes, doing searchsorted in np.unique(values) will only give [1,1,1,1,1...]. I misread before and thought that you were doing what I just said to do. – Justin Peel Jan 14 '11 at 13:00
2

Replacing len(list(g)) with sum(1 for i in g) gives a 2x speedup

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
2

In the latest version of numpy, we have this.

import numpy as np
frequency = np.unique(values, return_counts=True)
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
Gabriel
  • 161
  • 2
  • 11
0

You could try the following (ab)use of scipy.sparse:

from scipy import sparse
def sparse_bincount(values):
    M = sparse.csr_matrix((np.ones(len(values)), values.astype(int), [0, len(values)]))
    M.sum_duplicates()
    index = np.empty(len(M.indices),dtype='u4,u2')
    index['f0'] = M.indices
    index['f1']= M.data
    return index

This is slower than the winning answer, perhaps because scipy currently doesn't support unsigned as indices types...

joeln
  • 3,563
  • 25
  • 31
0

Sorting is theta(NlogN), I'd go for amortized O(N) provided by Python's hashtable implementation. Just use defaultdict(int) for keeping counts of the integers and just iterate over the array once:

counts = collections.defaultdict(int)
for v in values:
    counts[v] += 1

This is theoretically faster, unfortunately I have no way to check now. Allocating the additional memory might make it actually slower than your solution, which is in-place.

Edit: If you need to save memory try radix sort, which is much faster on integers than quicksort (which I believe is what numpy uses).

Rafał Dowgird
  • 43,216
  • 11
  • 77
  • 90
  • Hi Rafal, thanks for the answer! Unfortunately dicts are much less space efficient than numpy arrays and with 35,000,000 values I quickly run out of memory on my 4GB laptop. Trying to avoid divide and conquer if possible and do the full batch in one go. – Donny Jan 10 '11 at 22:07
  • Hi Rafal, I think the sort speed isn't the bottleneck, more the counting method which can function with an input of 35,000,000 integers. – Donny Jan 11 '11 at 01:31
  • @Donny When you have a look at the options above, you see that sorting actually takes 5-fold the time then the efficient counting with diff. So sorting _is_ the bottleneck, and this radix search could be a nice extension to the already suggested solutions. – Michael Feb 18 '13 at 04:03