2

I have a data frame as-

my_dt = dt.Frame({'last_name':['mallesh','bhavik','jagarini','mallesh','jagarini'],
                  'first_name':['yamulla','vemulla','yegurla','yamulla','yegurla'],
                  'ssn':['1234','7847','0648','4567','0648']})

enter image description here

Here I would like to find out duplicates considering last_name and firs_name columns and if any duplicates found their respective ssn needs to be rolled up with semicolon(;) if SSN are not different. if SSN are also same only one SSN needs to be present.

the expected output as:

enter image description here

Here since mallesh yamulla is duplicated and has different SSN's they are rolled up with ';'

and in case of jagarini yegurla it has a unique SSN hence one SSN is only taken.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31
myamulla_ciencia
  • 1,282
  • 1
  • 8
  • 30

1 Answers1

3

Use DataFrame.drop_duplicates by all columns first and then aggregate join:

df = (my_dt.drop_duplicates()
           .groupby(['last_name','first_name'], sort=False)['ssn']
           .agg(';'.join)
           .reset_index())
print (df)
  last_name first_name        ssn
0   mallesh    yamulla  1234;4567
1    bhavik    vemulla       7847
2  jagarini    yegurla       0648
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • In Pandas I can get it done. as I'm dealing with higher volumes of data I'm working with pydatatable. – myamulla_ciencia Aug 17 '22 at 05:14
  • 1
    @myamulla_ciencia - I only removed duplicates by `my_dt[:, first(f[0:]), by([f[0],f[1],f[2]], add_columns=False)]`, but string operation join are not implemented in `datatable`, so cannot do second step. Only way I think is convert to pandas and aggregate join. – jezrael Aug 17 '22 at 05:55
  • @myamulla_ciencia - I try `sum` by `my_dt[:, [dt.sum(f[2])], by([f[0],f[1]])]`, but it raise error `TypeError: Unable to apply reduce function `sum()` to a column of type `str32`` – jezrael Aug 17 '22 at 05:56