0

I have a data frame as-

my_dt = pl.DataFrame({'last_name':['mallesh','bhavik','jagarini','mallesh','jagarini'],
                  'first_name':['yamulla','vemulla','yegurla','yamulla','yegurla'],
                  'ssn':['1234','7847','0648','4567','0648']})

Here I would like to find out duplicates considering last_name and firs_name columns and if any duplicates found their respective ssn needs to be rolled up with semicolon(;) if SSN are not different. if SSN are also same only one SSN needs to be present.

enter image description here

the expected output as:

enter image description here

Here since mallesh yamulla is duplicated and has different SSN's they are rolled up with ';'

and in case of jagarini yegurla it has a unique SSN hence one SSN is only taken.

enter image description here

Added one more case:

Here on given any set of column it should rollup the unique values using ; from the remaining columns. here on last and first name, roll up should be done on both DOB and SSN.

my_dt = pl.DataFrame({'last_name':['mallesh','bhavik','jagarini','mallesh','jagarini'],
                  'first_name':['yamulla','vemulla','yegurla','yamulla','yegurla'],
                  'ssn':['1234','7847','0648','4567','0648'],
                  'dob':['10/11/1990','09/16/1991','01/01/1990','10/11/1990','02/14/1983']   })

enter image description here

Another case as:

my_dt = pl.DataFrame({'last_name':['mallesh','bhavik','jagarini','mallesh','jagarini'],
                  'first_name':['yamulla','vemulla','yegurla','yamulla','yegurla'],
                  'ssn':['1234','7847','0648','4567','0648'],
                  'dob':['10/11/1990','09/16/1991','01/01/1990','','02/14/1983']   })

In case of having null values in a field it should treat as empty not as a value.

";10/11/1990" it should just be "10/11/1990" for mallesh yamulla entry.

enter image description here

myamulla_ciencia
  • 1,282
  • 1
  • 8
  • 30

3 Answers3

1

Use a group_by and unique to remove duplicates. From there, you can use arr.join on the resulting list.

(
    my_dt
    .groupby(['last_name', 'first_name'])
    .agg([
        pl.col('ssn').unique()
    ])
    .with_column(
        pl.col('ssn').arr.join(';')
    )
)
shape: (3, 3)
┌───────────┬────────────┬───────────┐
│ last_name ┆ first_name ┆ ssn       │
│ ---       ┆ ---        ┆ ---       │
│ str       ┆ str        ┆ str       │
╞═══════════╪════════════╪═══════════╡
│ mallesh   ┆ yamulla    ┆ 4567;1234 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ bhavik    ┆ vemulla    ┆ 7847      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ jagarini  ┆ yegurla    ┆ 0648      │
└───────────┴────────────┴───────────┘

Edit: if you want to ensure that the rolled up list is sorted:

(
    my_dt
    .groupby(['last_name', 'first_name'])
    .agg([
        pl.col('ssn')
        .unique()
        .sort()
    ])
    .with_column(
        pl.col('ssn')
        .arr.join(';')
    )
)
shape: (3, 3)
┌───────────┬────────────┬───────────┐
│ last_name ┆ first_name ┆ ssn       │
│ ---       ┆ ---        ┆ ---       │
│ str       ┆ str        ┆ str       │
╞═══════════╪════════════╪═══════════╡
│ jagarini  ┆ yegurla    ┆ 0648      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ mallesh   ┆ yamulla    ┆ 1234;4567 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ bhavik    ┆ vemulla    ┆ 7847      │
└───────────┴────────────┴───────────┘

Edit: Rolling up multiple columns

We can roll up multiple columns elegantly as follows:

(
    my_dt
    .groupby(["last_name", "first_name"])
    .agg([
        pl.all().unique().sort().cast(pl.Utf8)
    ])
    .with_columns([
        pl.exclude(['last_name', 'first_name']).arr.join(";")
    ])
)
shape: (3, 4)
┌───────────┬────────────┬───────────┬───────────────────────┐
│ last_name ┆ first_name ┆ ssn       ┆ dob                   │
│ ---       ┆ ---        ┆ ---       ┆ ---                   │
│ str       ┆ str        ┆ str       ┆ str                   │
╞═══════════╪════════════╪═══════════╪═══════════════════════╡
│ bhavik    ┆ vemulla    ┆ 7847      ┆ 1991-09-16            │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ jagarini  ┆ yegurla    ┆ 0648      ┆ 1983-02-14;1990-01-01 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mallesh   ┆ yamulla    ┆ 1234;4567 ┆ 1990-10-11            │
└───────────┴────────────┴───────────┴───────────────────────┘

Edit: eliminating empty strings and null values from rollup

We can add a filter step just before the arr.join to filter out both null and empty string "" values.

(
    my_dt.groupby(["last_name", "first_name"])
    .agg([pl.all().unique().sort().cast(pl.Utf8)])
    .with_columns(
        [
            pl.exclude(["last_name", "first_name"])
            .arr.eval(
                pl.element().filter(pl.element().is_not_null() & (pl.element() != ""))
            )
            .arr.join(";")
        ]
    )
)
  • OK. what if I have to roll up more than one column. here we are just working on SSN. On given a set of column roll up should be done on the remaining columns. Example: On last and first name, it should roll up dob, ssn, address, etc etc fields. – myamulla_ciencia Aug 24 '22 at 13:40
  • I have added a screenshot for the above case. – myamulla_ciencia Aug 24 '22 at 13:41
  • One small question? : after rolling up I can't see the semicolon(;) if the column volumes are more than one: Example: SSN- ['1234',' 4567'] here it should be displayed as '1234;4567' with a string type and not a list type. – myamulla_ciencia Aug 25 '22 at 02:45
  • Good catch. Let me revise the answer. –  Aug 25 '22 at 04:01
  • I've also added an extra `cast` expression, to make sure that `Date`/`Datetime` fields work correctly. –  Aug 25 '22 at 04:09
  • That's great. It works as expected – myamulla_ciencia Aug 25 '22 at 04:28
  • OK. how to ignore null while rolling up: example - after rolling on last and first name, DOB comes as this since there is a NULL in one row ";10/11/1990". but here i should only get "10/11/1990". technically NULL should not be treated on rolling columns. how to filter them in above code snippet. added the same case to the question above. – myamulla_ciencia Aug 25 '22 at 04:44
  • On my real dataset it is showing as : "null;01/06/1940". it should just be "01/06/1940" – myamulla_ciencia Aug 25 '22 at 04:47
  • I’ve added a `filter` step to eliminate both `null` values and empty strings `””`. –  Aug 25 '22 at 05:38
  • 1
    Fabulous, all of my use cases are covered with this. thanks a ton. – myamulla_ciencia Aug 25 '22 at 05:50
0

I would suggest using lists to store the values:

new_df = my_dt.groupby(['last_name', 'first_name']).agg(list).reset_index()

Output:

  last_name first_name           ssn                       dob
0    bhavik    vemulla        [7847]              [09/16/1991]
1  jagarini    yegurla  [0648, 0648]  [01/01/1990, 02/14/1983]
2   mallesh    yamulla  [1234, 4567]  [10/11/1990, 10/11/1990]

You can convert to concatenated values with:

for col in new_df.set_index(['last_name', 'first_name']).columns:
    new_df[col] = new_df[col].apply(';'.join)

Output:

  last_name first_name        ssn                    dob
0    bhavik    vemulla       7847             09/16/1991
1  jagarini    yegurla  0648;0648  01/01/1990;02/14/1983
2   mallesh    yamulla  1234;4567  10/11/1990;10/11/1990
René
  • 4,594
  • 5
  • 23
  • 52
0

You can start by using group_by and agg functions to get the results:

my_dt.unique().groupby(['first_name', 'last_name']).agg(pl.col('ssn').apply(lambda x: ';'.join(x)))
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Aug 29 '22 at 16:40