How to drop duplicates in a python datatable h2oai

Question

The datatable package in python (https://github.com/h2oai/datatable/) can count the number of unique values in a column, Is there a way to drop the duplicates values with this package or I have to use the slow pandas package?

score 7 · Accepted Answer · answered Dec 30 '19 at 23:43

If you want to find the unique values in a single column, then you can use function dt.unique(), which takes a column and returns a new column with all unique values from the original:

>>> import datatable as dt
>>> DT = dt.Frame(A=[1, 3, 2, 1, 4, 2, 1], B=list("ABCDEFG"))
>>> dt.unique(DT["A"])
   |  A
-- + --
 0 |  1
 1 |  2
 2 |  3
 3 |  4

[4 rows x 1 column]

If, on the other hand, you have a multi-column Frame and you want to only keep rows with the unique values in one of the columns, then this is equivalent to grouping by that column, and can be approached as such:

>>> from datatable import f, by, first
>>> DT[:, first(f[1:]), by(f[0])]
   |  A  B 
-- + --  --
 0 |  1  A 
 1 |  2  C 
 2 |  3  B 
 3 |  4  E 

[4 rows x 2 columns]

to keep first (or last) row only what's the difference between way above `DT[:, first(f[1:]), by(f[0])]` and another way found in the docs: `DT[1, :, by('A')]` (https://datatable.readthedocs.io/en/latest/manual/groupby_examples.html?highlight=group%20by%20duplicate) ? — topchef, Dec 30 '21 at 01:30
@pasha any smart way to group by all existing columns ? so that would make it equivalent to drop_duplicate in pandas, I can use f[0], f[1] .... but that doesn't look smart enough. — Areza, Apr 06 '22 at 13:35

How to drop duplicates in a python datatable h2oai

1 Answers1