I am attempting to find the most performant method to find unique values from a NumPy array. NumPy's unique
function is very slow and sorts the values first before finding the unique. Pandas hashes the values using the klib C library which is much faster. I am looking for a Cython solution.
The simplest solution seems to just iterate through the array and use a Python set to add each element like this:
from numpy cimport ndarray
from cpython cimport set
@cython.wraparound(False)
@cython.boundscheck(False)
def unique_cython_int(ndarray[np.int64_t] a):
cdef int i
cdef int n = len(a)
cdef set s = set()
for i in range(n):
s.add(a[i])
return s
I also tried an unordered_set from c++
from libcpp.unordered_set cimport unordered_set
@cython.wraparound(False)
@cython.boundscheck(False)
def unique_cpp_int(ndarray[np.int64_t] a):
cdef int i
cdef int n = len(a)
cdef unordered_set[int] s
for i in range(n):
s.insert(a[i])
return s
Performance
# create array of 1,000,000
a = np.random.randint(0, 50, 1000000)
# Pure Python
%timeit set(a)
86.4 ms ± 2.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Convert to list first
a_list = a.tolist()
%timeit set(a_list)
10.2 ms ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# NumPy
%timeit np.unique(a)
32 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Pandas
%timeit pd.unique(a)
5.3 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Cython
%timeit unique_cython_int(a)
13.4 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Cython - c++ unordered_set
%timeit unique_cpp_int(a)
17.8 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Discussion
So pandas is about 2.5x faster than a cythonized set. Its lead increases when there are more distinct elements. Surprisingly, a pure python set (on a list) beats out a cythonized set.
My question here - is there a faster way to do this in Cython than just use the add
method repeatedly? And could the c++ unordered_set be improved?
Using Unicode strings
The story changes when we use unicode strings. I believe I have to convert the numpy array to an object
data type to properly add its type for Cython.
@cython.wraparound(False)
@cython.boundscheck(False)
def unique_cython_str(ndarray[object] a):
cdef int i
cdef int n = len(a)
cdef set s = set()
for i in range(n):
s.add(a[i])
return s
And again I tried an unordered_set
from c++
@cython.wraparound(False)
@cython.boundscheck(False)
def unique_cpp_str(ndarray[object] a):
cdef int i
cdef int n = len(a)
cdef unordered_set[string] s
for i in range(n):
s.insert(a[i])
return s
Performance
Create an array of 1 million strings with 1,000 distinct values
s_1000 = []
for i in range(1000):
s = np.random.choice(list('abcdef'), np.random.randint(5, 50))
s_1000.append(''.join(s))
s_all = np.random.choice(s_1000, 1000000)
# s_all has numpy unicode as its data type. Must convert to object
s_unicode_obj = s_all.astype('O')
# c++ does not easily handle unicode. Convert to bytes and then to object
s_bytes_obj = s_all.astype('S').astype('O')
# Pure Python
%timeit set(s_all)
451 ms ± 5.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit set(s_unicode_obj)
71.9 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# using set on a list
s_list = s_all.tolist()
%timeit set(s_list)
63.1 ms ± 7.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# NumPy
%timeit np.unique(s_unicode_obj)
1.69 s ± 97.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(s_all)
633 ms ± 3.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Pandas
%timeit pd.unique(s_unicode_obj)
97.6 ms ± 6.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Cython
%timeit unique_cython_str(s_unicode_obj)
60 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Cython - c++ unordered_set
%timeit unique_cpp_str2(s_bytes_obj)
247 ms ± 8.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Discussion
So, it appears that Python's set outperforms pandas for unicode strings but not on integers. And again, iterating through the array in Cython doesn't really help us at all.
Cheating with integers
It's possible to circumvent sets if you know the range of your integers isn't too crazy. You can simply create a second array of all zeros/False
and turn their position True
when you encounter each one and append that number to a list. This is extremely fast since no hashing is done.
The following works for positive integer arrays. If you had negative integers, you would have to add a constant to shift the numbers up to 0.
@cython.wraparound(False)
@cython.boundscheck(False)
def unique_bounded(ndarray[np.int64_t] a):
cdef int i, n = len(a)
cdef ndarray[np.uint8_t, cast=True] unique = np.zeros(n, dtype=bool)
cdef list result = []
for i in range(n):
if not unique[a[i]]:
unique[a[i]] = True
result.append(a[i])
return result
%timeit unique_bounded(a)
1.18 ms ± 21.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The downside is of course memory usage since your largest integer could force an extremely large array. But this method could work for floats too if you knew precisely how many significant digits each number had.
Summary
Integers 50 unique of 1,000,000 total
- Pandas - 5 ms
- Python set of list - 10 ms
- Cython set - 13 ms
- 'Cheating' with integers - 1.2 ms
Strings 1,000 unique of 1,000,000 total
- Cython set - 60 ms
- Python set of list - 63 ms
- Pandas - 98 ms
Appreciate all the help making these faster.