I have a data frame as-
my_dt = pl.DataFrame({'last_name':['mallesh','bhavik','jagarini','mallesh','jagarini'],
'first_name':['yamulla','vemulla','yegurla','yamulla','yegurla'],
'ssn':['1234','7847','0648','4567','0648']})
Here I would like to find out duplicates considering last_name and firs_name columns and if any duplicates found their respective ssn needs to be rolled up with semicolon(;) if SSN are not different. if SSN are also same only one SSN needs to be present.
the expected output as:
Here since mallesh yamulla is duplicated and has different SSN's they are rolled up with ';'
and in case of jagarini yegurla it has a unique SSN hence one SSN is only taken.
Added one more case:
Here on given any set of column it should rollup the unique values using ; from the remaining columns. here on last and first name, roll up should be done on both DOB and SSN.
my_dt = pl.DataFrame({'last_name':['mallesh','bhavik','jagarini','mallesh','jagarini'],
'first_name':['yamulla','vemulla','yegurla','yamulla','yegurla'],
'ssn':['1234','7847','0648','4567','0648'],
'dob':['10/11/1990','09/16/1991','01/01/1990','10/11/1990','02/14/1983'] })
Another case as:
my_dt = pl.DataFrame({'last_name':['mallesh','bhavik','jagarini','mallesh','jagarini'],
'first_name':['yamulla','vemulla','yegurla','yamulla','yegurla'],
'ssn':['1234','7847','0648','4567','0648'],
'dob':['10/11/1990','09/16/1991','01/01/1990','','02/14/1983'] })
In case of having null values in a field it should treat as empty not as a value.
";10/11/1990" it should just be "10/11/1990" for mallesh yamulla entry.