How to find and mark duplicates in a python datatable

Question

I would like to identify the duplicated rows in a py-dtatable by group (and create a helper column C with a bool).

It should work along the lines of this:

DT = dt.Frame(A=[1, 2, 1, 2, 2, 1], B=list("XXYYYY"))

I get -> TypeError: Expected a Frame, instead got class 'datatable.expr.expr.Expr' when i'm applying the grouping on it to find out the unique observations for a group.

However, unique() doesn't not work and the documentation on the available functions for py-datatable is pretty sparse: https://datatable.readthedocs.io/en/v0.10.1/using-datatable.html#perform-groupby-calculations

I'm not sure if py-datatable is that much behind R datatable and its not possible as it seems like a basic operation but I cant find the solution. Does someone have it or can point me in the direction of resources please? Ideally this would include the syntax with the assignment of the bool(duplicate or not) in a new column C in one line of code.

Kindly add your expected output – sammywemmy Jun 15 '20 at 22:42 — sammywemmy, Jun 15 '20 at 22:42

score 2 · Accepted Answer · answered Jun 16 '20 at 03:15

As far my understandings,

He would like to create a column to indicate whether the particular observation is duplicated or not.

Here is my solution:

import datatable as dt
from datatable import by,f,count

sample datatable -

DT_EX = dt.Frame(A=list("XXYYYYXX"),B=[1, 2, 1, 2, 2, 1,3,3])

Out[3]: 
   | A    B
-- + --  --
 0 | X    1
 1 | X    2
 2 | Y    1
 3 | Y    2
 4 | Y    2
 5 | Y    1
 6 | X    3
 7 | X    3

[8 rows x 2 columns]

and execute this code chunk-

DT[:,count(),by(f.A,f.B)][:,f[:].extend({'duplicated': f.count>1 })]

it works like first apply grouping on cols A,B,count the observations per groups. Next it extend the datatable with a new column called duplicated, in this if count is having more than 1 value it should be filled as duplicated 'True' else 'False'

output is-

Out[5]: 
   | A    B  count  duplicated
-- + --  --  -----  ----------
 0 | X    1      1           0
 1 | X    2      1           0
 2 | X    3      2           1
 3 | Y    1      2           1
 4 | Y    2      2           1

[5 rows x 4 columns]

Thank you a lot for your answer. Upon studying your silution I also found an alternative which is more compact. `from datatable import update
DT[:,update(duplicated=(count()>1)),by(f.A,f.B)]` — Zappageck, Jun 16 '20 at 09:21

How to find and mark duplicates in a python datatable

1 Answers1