How to turn Numpy array to set efficiently?

Question

I used:

df['ids'] = df['ids'].values.astype(set)

to turn lists into sets, but the output was a list not a set:

>>> x = np.array([[1, 2, 2.5],[12,35,12]])

>>> x.astype(set)
array([[1.0, 2.0, 2.5],
       [12.0, 35.0, 12.0]], dtype=object)

Is there an efficient way to turn list into set in Numpy?

EDIT 1:
My input is as big as below:
I have 3,000 records. Each has 30,000 ids: [[1,...,12,13,...,30000], [1,..,43,45,...,30000],...,[...]]

@AlirezaHos it doesn't seem to me that processing `x = np.array([[1, 2, 2.5],[12,35,12]])` should take 19 seconds with *any* method. Care to elaborate? — Andras Deak -- Слава Україні, Oct 18 '15 at 09:59
`astype(set)` does not do what you think. There isn't a `numpy` set `dtype`. So it just returns an `object` array. — hpaulj, Oct 18 '15 at 15:03

P. Camilleri · Answer 1 · 2015-10-18T09:25:02.707

26

First flatten your ndarray to obtain a single dimensional array, then apply set() on it:

set(x.flatten())

Edit : since it seems you just want an array of set, not a set of the whole array, then you can do value = [set(v) for v in x] to obtain a list of sets.

edited Oct 18 '15 at 09:25

answered Oct 18 '15 at 09:20

P. Camilleri

12,664
7
41
76

I don't want to merge lists into one. I want to turn each individually into a set – Alireza Oct 18 '15 at 09:21
value = [set(v) for v in x], then ? – P. Camilleri Oct 18 '15 at 09:22
As I said I'm looking for an efficient way. Your second solution isn't the best way. For my case it takes around 19 seconds. – Alireza Oct 18 '15 at 09:54
2

@AlirezaHos any reason to believe this solution is not efficient? How much data is processed in 19 seconds? 10 elements? 100? 10^10? And any reason for *not including your complete problem in the original question*? – Andras Deak -- Слава Україні Oct 18 '15 at 09:58
@AndrasDeak, I have 3,000 records. Each has 30,000 ids: `[[1,...,12,13,...,30000], [1,..,43,45,...,30000]]` – Alireza Oct 18 '15 at 09:59
1

@AlirezaHos Any specific to convert all of that data into sets? Storing as numpy arrays must be pretty efficient. – Divakar Oct 18 '15 at 10:02
@Divakar, I want to remove duplicates and merge them. – Alireza Oct 18 '15 at 10:03
1

@Divakar especially that converting to sets should involve going over each element and checking for multiplicities and sorting. No wonder it's slow:) – Andras Deak -- Слава Україні Oct 18 '15 at 10:03
@AlirezaHos check out `np.unique(v)` or maybe a list comprehension doing this. Or `map`, but I haven't looked at that yet. – Andras Deak -- Слава Україні Oct 18 '15 at 10:04
@AndrasDeak, what do you suggest for my case in order to remove duplicate elements. Moreover I in some cases I use `groupby` in `pandas` and I want to merge them after groupby without any duplicates. for now I use `set.union` to merge those sets into one set. – Alireza Oct 18 '15 at 10:05
1

@AlirezaHos a question which leads to 10 comments is generally the sign that information was missing in the original post. Please take note for your future posts. – P. Camilleri Oct 18 '15 at 15:02
`set(x.ravel())` should be more efficient because it doesn't make a copy. (Actually `set(x.flat)` is even better.) – endolith Jan 06 '19 at 03:33

Andras Deak -- Слава Україні · Accepted Answer · 2023-04-18T17:58:44.423

The current state of your question (can change any time): how can I efficiently remove duplicate elements from a large array of large arrays?

import numpy as np

rng = np.random.default_rng()
arr = rng.random((3000, 30000))
out1 = list(map(np.unique, arr))
#or
out2 = [np.unique(subarr) for subarr in arr]

Runtimes in an IPython shell:

>>> %timeit list(map(np.unique, arr))
5.39 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit [np.unique(subarr) for subarr in arr]
5.42 s ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Update: as @hpaulj pointed out in his comment, my dummy example is biased since floating-point random numbers will almost certainly be unique. So here's a more life-like example with integer numbers:

>>> arr = rng.integers(low=1, high=15000, size=(3000, 30000))

>>> %timeit list(map(np.unique, arr))
4.98 s ± 83.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit [np.unique(subarr) for subarr in arr]
4.95 s ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In this case the elements of the output list have varying lengths, since there are actual duplicates to remove.

Should not the problem be stated as "keep unique elements" or "remove duplicates" rather than "remove unique"? — Dimitri Lesnoff, Apr 18 '23 at 13:06

score 1 · Answer 3 · edited Sep 14 '18 at 10:59

1

A couple of earlier 'row-wise' unique questions:

vectorize numpy unique for subarrays

Numpy: Row Wise Unique elements

Count unique elements row wise in an ndarray

In a couple of these the count is more interesting than the actual unique values.

If the number of unique values per row differs, then the result cannot be a (2d) array. That's a pretty good indication that the problem cannot be fully vectorized. You need some sort of iteration over the rows.

edited Sep 14 '18 at 10:59

Dadep

2,796
5
27
40

answered Oct 18 '15 at 16:53

hpaulj

221,503
14
230
353

@fabrik, those are SO links, in a 3 yr old answer. By your logic we couldn't mark posts as duplicates without repeating the old answers. – hpaulj Sep 14 '18 at 16:05

How to turn Numpy array to set efficiently?

3 Answers3