Determining duplicate values in an array

Question

Suppose I have an array

a = np.array([1, 2, 1, 3, 3, 3, 0])

How can I (efficiently, Pythonically) find which elements of a are duplicates (i.e., non-unique values)? In this case the result would be array([1, 3, 3]) or possibly array([1, 3]) if efficient.

I've come up with a few methods that appear to work:

Masking

m = np.zeros_like(a, dtype=bool)
m[np.unique(a, return_index=True)[1]] = True
a[~m]

Set operations

a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True)[1], assume_unique=True)]

This one is cute but probably illegal (as a isn't actually unique):

np.setxor1d(a, np.unique(a), assume_unique=True)

Histograms

u, i = np.unique(a, return_inverse=True)
u[np.bincount(i) > 1]

Sorting

s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]

Pandas

s = pd.Series(a)
s[s.duplicated()]

Is there anything I've missed? I'm not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).

Conclusions

Testing with a 10 million size data set (on a 2.8GHz Xeon):

a = np.random.randint(10**7, size=10**7)

The fastest is sorting, at 1.1s. The dubious xor1d is second at 2.6s, followed by masking and Pandas Series.duplicated at 3.1s, bincount at 5.6s, and in1d and senderle's setdiff1d both at 7.3s. Steven's Counter is only a little slower, at 10.5s; trailing behind are Burhan's Counter.most_common at 110s and DSM's Counter subtraction at 360s.

I'm going to use sorting for performance, but I'm accepting Steven's answer because the performance is acceptable and it feels clearer and more Pythonic.

Edit: discovered the Pandas solution. If Pandas is available it's clear and performs well.

Could you explain why the sorting solution works? I tried it out but for some reason I don't really get it. — Markus, Jul 18 '13 at 16:06
@Markus if you sort an array, any duplicate values are adjacent. You then use a boolean mask to take only those items that are equal to the previous item. — ecatmur, Jul 18 '13 at 16:53
Shouldn't it be `s[:-1][ s[1:] == s[:-1] ]`? I get an `IndexError` otherwise, the boolean mask being one element shorter than the `s`-array.... — snake_charmer, Nov 13 '17 at 16:22
@snake_charmer I think earlier versions of numpy were more forgiving in this regard. I'll fix it, thanks. — ecatmur, Nov 13 '17 at 17:43
pandas seems to have improved the performance of some underlying methods. On my machine, pandas is only 29% slower than the sorting method. The method proposed by Mad Physicist is 17% slower than sorting. — JE_Muc, Oct 04 '19 at 11:14
FYI: I just tried the sorting solution on a normal list, which does not work. However with a sorted `numpy array` it does. — Godrebh, Jun 15 '20 at 11:29

Mad Physicist · Answer 1 · 2020-06-30T19:06:12.933

88

As of numpy version 1.9.0, np.unique has an argument return_counts which greatly simplifies your task:

u, c = np.unique(a, return_counts=True)
dup = u[c > 1]

This is similar to using Counter, except you get a pair of arrays instead of a mapping. I'd be curious to see how they perform relative to each other.

It's probably worth mentioning that even though np.unique is quite fast in practice due to its numpyness, it has worse algorithmic complexity than the Counter solution. np.unique is sort-based, so runs asymptotically in O(n log n) time. Counter is hash-based, so has O(n) complexity. This will not matter much for anything but the largest datasets.

edited Jun 30 '20 at 19:06

answered Jul 12 '18 at 05:16

Mad Physicist

107,652
25
181
264

Simple solution. Thanks. – dmrobotix Aug 26 '22 at 22:31
...and on a very large dataset Counter's per-object space overhead can be prohibitive. – TLW May 03 '23 at 01:47
@TLW. Presumably the dataset is already in memory. Counter's hashtable is just an array with maybe some additional storage for buckets. Objects are stored by reference, not copy. – Mad Physicist May 03 '23 at 04:49
@MadPhysicist - You should look at the size of Python objects. The per-object overhead may surprise you. – TLW May 03 '23 at 05:11
numpy arrays don't have said overhead because they directly store the data - it's one of the reasons why numpy arrays can be quirky to work with, but also means that e.g. a billion-element uint8 array takes ~1GB instead of the ~8B it would be if the elements were Python objects. – TLW May 03 '23 at 05:15
@TLW. Counter does not copy objects. It stores a hashmap of pointers. The overhead is for the hashmap, not for its contents. Assuming the objects are already in memory of course. And yes, the size of a pointer is generally > sizeof(uint8_t) – Mad Physicist May 03 '23 at 06:43
@MadPhysicist again, you should really look at what happens under the hood. Counter stores a hashmap of pointers _to Python objects_. Numpy arrays _do not store Python objects_ - when you add a value from a Numpy array to a Counter what happens is that the value is copied to a newly-allocated Python object, and the reference to said new object is passed to Counter. – TLW May 04 '23 at 03:13
All told a large Python 3 array of `int`s on my machine are ~30 to ~60 bytes per integer, depending on size (the array allocation can have extra padding). Beware if you are measuring this yourself that a) `sys.getsizeof` does not count the costs of pointed-to objects, just the object itself, b) the size of Python 3 `int`s vary with the number of bits stored, and c) Python `int`s are partially interned - doing e.g. `[0]*1_000_000` will give you a very misleading idea of Python object size, as it's really a million references to `0`. – TLW May 04 '23 at 03:21
In practice on my machine: an array of 100 million random uint64s in a numpy array results in a max RSS of ~810MB - 8B/element + a little overhead for Python in general, which make sense. Passing that to a Counter results in a max RSS of ~11.4*G*B. In other words, the Counter peaks at ~13x as much memory as the numpy array (!). Easy to test: `import numpy; import collections; x = numpy.random.randint(2**64-1, size=100000000, dtype=numpy.uint64); c = collections.Counter(x)` in a Python file, with or without the last statement commented out. – TLW May 04 '23 at 03:37
1

@TLW. Gotcha. Converting a packed numpy array is indeed going to be expensive. I completely forgot that this or exclusively about numpy and assumed the data would be in memory as python objects ahead of time. I'm familiar with the python integer format and interning. – Mad Physicist May 04 '23 at 03:49

Steven Rumbalski · Accepted Answer · 2020-10-29T12:39:00.427

38

I think this is most clear done outside of numpy. You'll have to time it against your numpy solutions if you are concerned with speed.

>>> import numpy as np
>>> from collections import Counter
>>> a = np.array([1, 2, 1, 3, 3, 3, 0])
>>> [item for item, count in Counter(a).items() if count > 1]
[1, 3]

note: This is similar to Burhan Khalid's answer, but the use of items without subscripting in the condition should be faster.

edited Oct 29 '20 at 12:39

answered Jul 17 '12 at 18:25

Steven Rumbalski

44,786
9
89
119

18

Note: Counter(a).items() has to be used in python 3 – Maaaaa Mar 14 '17 at 13:21

score 12 · Answer 3 · answered Jul 17 '12 at 20:10

12

People have already suggested Counter variants, but here's one which doesn't use a listcomp:

>>> from collections import Counter
>>> a = [1, 2, 1, 3, 3, 3, 0]
>>> (Counter(a) - Counter(set(a))).keys()
[1, 3]

[Posted not because it's efficient -- it's not -- but because I think it's cute that you can subtract Counter instances.]

answered Jul 17 '12 at 20:10

DSM

342,061
65
592
494

More efficient not to recompute set: `c = Counter(a); result = (c - Counter(c.keys())).keys()` – Mad Physicist May 12 '21 at 21:43

score 7 · Answer 4 · answered Jul 17 '12 at 18:02

7

For Python 2.7+

>>> import numpy
>>> from collections import Counter
>>> n = numpy.array([1,1,2,3,3,3,0])
>>> [x[1] for x in Counter(n).most_common() if x[0] > 1]
[3, 1]

answered Jul 17 '12 at 18:02

Burhan Khalid

169,990
18
245
284

shouldn't x[0] > 1 be x[1] > 1? the latter x represents the frequency. – Green Falcon Apr 23 '18 at 07:33

score 6 · Answer 5 · answered Jul 17 '12 at 18:32

6

Here's another approach using set operations that I think is a bit more straightforward than the ones you offer:

>>> indices = np.setdiff1d(np.arange(len(a)), np.unique(a, return_index=True)[1])
>>> a[indices]
array([1, 3, 3])

I suppose you're asking for numpy-only solutions, since if that's not the case, it's very difficult to argue with just using a Counter instead. I think you should make that requirement explicit though.

answered Jul 17 '12 at 18:32

senderle

145,869
36
209
233

I see it as a wart on this approach is that the `3` is repeated while the `1` is not. It would be nice to have it one way or the other. (This is not a criticism of your answer so much as of the original approach by the OP.) – Steven Rumbalski Jul 17 '12 at 18:36
@StevenRumbalski, yeah, I see what you mean. My sense is that the repeated `3` makes sense if what's really needed is a mask rather than a list of items; if what's needed is a list of items, then I agree that not having repeated items is better. – senderle Jul 17 '12 at 18:39
I'm not opposed to using `Counter`, but I am concerned about efficiency and compatibility. – ecatmur Jul 18 '12 at 07:42

Bi Rico · Answer 6 · 2012-07-17T19:07:18.373

5

If a is made up of small integers you can use numpy.bincount directly:

import numpy as np

a = np.array([3, 2, 2, 0, 4, 3])
counts = np.bincount(a)
print np.where(counts > 1)[0]
# array([2, 3])

This is very similar your "histogram" method, which is the one I would use if a was not made up of small integers.

edited Jul 17 '12 at 19:07

answered Jul 17 '12 at 18:34

Bi Rico

25,283
3
52
75

score 5 · Answer 7 · answered Jun 15 '15 at 04:12

5

If the array is a sorted numpy array, then just do:

a = np.array([1, 2, 2, 3, 4, 5, 5, 6])
rep_el = a[np.diff(a) == 0]

answered Jun 15 '15 at 04:12

Thomas Antony

544
1
7
17

`a[1:][np.diff(a) == 0]`, no? – Mad Physicist Jul 12 '18 at 05:37

score 3 · Answer 8 · answered Jun 11 '15 at 21:04

I'm adding my solution to the pile for this 3 year old question because none of the solutions fit what I wanted or used libs besides numpy. This method finds both the indices of duplicates and values for distinct sets of duplicates.

import numpy as np

A = np.array([1,2,3,4,4,4,5,6,6,7,8])

# Record the indices where each unique element occurs.
list_of_dup_inds = [np.where(a == A)[0] for a in np.unique(A)]

# Filter out non-duplicates.
list_of_dup_inds = filter(lambda inds: len(inds) > 1, list_of_dup_inds)

for inds in list_of_dup_inds: print inds, A[inds]
# >> [3 4 5] [4 4 4]
# >> [7 8] [6 6]

Three years later still, and you can use the `return_counts` argument to `unique` for this too. See my answer. — Mad Physicist, Jul 12 '18 at 13:45

score 3 · Answer 9 · answered Apr 22 '19 at 13:56

>>> import numpy as np

>>> a=np.array([1,2,2,2,2,3])

>>> uniques, uniq_idx, counts = np.unique(a,return_index=True,return_counts=True)
>>> duplicates = a[ uniq_idx[counts>=2] ]  # <--- Get duplicates

If you also want to get the orphans:

>>> orphans = a[ uniq_idx[counts==1] ]

score 1 · Answer 10 · answered Mar 24 '22 at 23:46

1

Combination of Pandas and Numpy (Utilizing value_counts():

import pandas as pd
import numpy as np

arr=np.array(('a','b','b','c','a'))
pd.Series(arr).value_counts()

OUTPUT:

a    2
b    2
c    1

answered Mar 24 '22 at 23:46

DYARI MOHAMMED SHARIF

19
4

score 1 · Answer 11 · answered Apr 04 '23 at 20:19

Using the subtract method of a counter circumvents creating a second counter like in DSM's answer and to get only the positive counts (ie: duplicates) use the unary + operator on the resulting Counter

a = [1, 2, 1, 3, 3, 3, 0]

c = Counter(a)
c.subtract(d.keys())
dupes = (+c).keys()

I found this to be the best performing solution in my testing