How to roll up duplicate observation in pydatatable?

Question

I have a data frame as-

my_dt = dt.Frame({'last_name':['mallesh','bhavik','jagarini','mallesh','jagarini'],
                  'first_name':['yamulla','vemulla','yegurla','yamulla','yegurla'],
                  'ssn':['1234','7847','0648','4567','0648']})

Here I would like to find out duplicates considering last_name and firs_name columns and if any duplicates found their respective ssn needs to be rolled up with semicolon(;) if SSN are not different. if SSN are also same only one SSN needs to be present.

the expected output as:

Here since mallesh yamulla is duplicated and has different SSN's they are rolled up with ';'

and in case of jagarini yegurla it has a unique SSN hence one SSN is only taken.

at the moment, string operations in datatable are very limited. You can drop duplicates with a cumulative count, however, for the string operations, you are better served with Pandas — sammywemmy, Aug 17 '22 at 08:05
yea, polars is another exciting library. Hopefully, soon pydatatable will be more robust in terms of features — sammywemmy, Aug 17 '22 at 09:26

score 3 · Answer 1 · answered Aug 17 '22 at 05:07

3

Use DataFrame.drop_duplicates by all columns first and then aggregate join:

df = (my_dt.drop_duplicates()
           .groupby(['last_name','first_name'], sort=False)['ssn']
           .agg(';'.join)
           .reset_index())
print (df)
  last_name first_name        ssn
0   mallesh    yamulla  1234;4567
1    bhavik    vemulla       7847
2  jagarini    yegurla       0648

answered Aug 17 '22 at 05:07

jezrael

822,522
95
1,334
1,252

In Pandas I can get it done. as I'm dealing with higher volumes of data I'm working with pydatatable. – myamulla_ciencia Aug 17 '22 at 05:14
1

@myamulla_ciencia - I only removed duplicates by `my_dt[:, first(f[0:]), by([f[0],f[1],f[2]], add_columns=False)]`, but string operation join are not implemented in `datatable`, so cannot do second step. Only way I think is convert to pandas and aggregate join. – jezrael Aug 17 '22 at 05:55
@myamulla_ciencia - I try `sum` by `my_dt[:, [dt.sum(f[2])], by([f[0],f[1]])]`, but it raise error `TypeError: Unable to apply reduce function `sum()` to a column of type `str32`` – jezrael Aug 17 '22 at 05:56

How to roll up duplicate observation in pydatatable?

1 Answers1