24

I used:

df['ids'] = df['ids'].values.astype(set)

to turn lists into sets, but the output was a list not a set:

>>> x = np.array([[1, 2, 2.5],[12,35,12]])

>>> x.astype(set)
array([[1.0, 2.0, 2.5],
       [12.0, 35.0, 12.0]], dtype=object)

Is there an efficient way to turn list into set in Numpy?

EDIT 1:
My input is as big as below:
I have 3,000 records. Each has 30,000 ids: [[1,...,12,13,...,30000], [1,..,43,45,...,30000],...,[...]]

rayryeng
  • 102,964
  • 22
  • 184
  • 193
Alireza
  • 6,497
  • 13
  • 59
  • 132

3 Answers3

26

First flatten your ndarray to obtain a single dimensional array, then apply set() on it:

set(x.flatten())

Edit : since it seems you just want an array of set, not a set of the whole array, then you can do value = [set(v) for v in x] to obtain a list of sets.

P. Camilleri
  • 12,664
  • 7
  • 41
  • 76
14

The current state of your question (can change any time): how can I efficiently remove duplicate elements from a large array of large arrays?

import numpy as np

rng = np.random.default_rng()
arr = rng.random((3000, 30000))
out1 = list(map(np.unique, arr))
#or
out2 = [np.unique(subarr) for subarr in arr]

Runtimes in an IPython shell:

>>> %timeit list(map(np.unique, arr))
5.39 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit [np.unique(subarr) for subarr in arr]
5.42 s ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Update: as @hpaulj pointed out in his comment, my dummy example is biased since floating-point random numbers will almost certainly be unique. So here's a more life-like example with integer numbers:

>>> arr = rng.integers(low=1, high=15000, size=(3000, 30000))

>>> %timeit list(map(np.unique, arr))
4.98 s ± 83.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit [np.unique(subarr) for subarr in arr]
4.95 s ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In this case the elements of the output list have varying lengths, since there are actual duplicates to remove.

1

A couple of earlier 'row-wise' unique questions:

vectorize numpy unique for subarrays

Numpy: Row Wise Unique elements

Count unique elements row wise in an ndarray

In a couple of these the count is more interesting than the actual unique values.

If the number of unique values per row differs, then the result cannot be a (2d) array. That's a pretty good indication that the problem cannot be fully vectorized. You need some sort of iteration over the rows.

Dadep
  • 2,796
  • 5
  • 27
  • 40
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • @fabrik, those are SO links, in a 3 yr old answer. By your logic we couldn't mark posts as duplicates without repeating the old answers. – hpaulj Sep 14 '18 at 16:05