How to create a polars column listing duplicates of another column

Question

I have hard a hard time searching for the answer to this as I find it hard to put into words.

I'm trying to aggregate multiple listings of files on disks, some of which have the same files. I want only one row for a given file, and a separate column with the disks that file may be on.

Say I have the following DataFrame:

recordings = pl.DataFrame(
    {
        "disk": ["NT23", "NT24", "NT23", "NT24"],
        "path_on_disk": ["file_a.txt", "file_a.txt", "file_b.txt", "file_b.txt"],
        "other_data": [2.0, 2.0, 3.0, 3.0],
    }
)

Which looks something like this (SO doesn't like terminal characters I guess):

┌──────┬──────────────┬────────────┐
│ disk ┆ path_on_disk ┆ other_data │
│ ---  ┆ ---          ┆ ---        │
│ str  ┆ str          ┆ f64        │
╞══════╪══════════════╪════════════╡
│ NT23 ┆ file_a.txt   ┆ 2.0        │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NT24 ┆ file_a.txt   ┆ 2.0        │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NT23 ┆ file_b.txt   ┆ 3.0        │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NT24 ┆ file_b.txt   ┆ 3.0        │
└──────┴──────────────┴────────────┘

Actual df has > 500k rows
Actual df has more columns, but the if path_on_disk matches in two or more rows then we will assume that the rest of the fields in those rows match, except for disk.

I want to:

Find all rows where path_on_disk are the same
Make a new column disks containing the different values of disk joined together with ", ".join()

Something like this:

┌──────────────┬────────────┬────────────┐
│ path_on_disk ┆ disks      ┆ other_data │
│ ---          ┆ ---        ┆ ---        │
│ str          ┆ str        ┆ f64        │
╞══════════════╪════════════╪════════════╡
│ file_a       ┆ NT23, NT24 ┆ 2.0        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ file_b       ┆ NT23, NT24 ┆ 3.0        │
└──────────────┴────────────┴────────────┘

I've figured out that I can use recordings.groupby(["path_on_disk"]) to accomplish the first objective:

for group_df in recordings.groupby(["path_on_disk"]):
    if len(group_df) > 1:
        print(group_df)
        break

This shows the first found group of rows where path_on_disk match.

I tried the following, but got an error:

def merge_disknames(df: pl.DataFrame): 
    return ", ".join(sorted(df["disk"]))

recordings.groupby(["path_on_disk"]).apply(merge_disknames).rename("disks")

PanicException: Could net get DataFrame attribute '_df'. Make sure that you return a DataFrame object.: PyErr { type: <class 'AttributeError'>, value: AttributeError("'str' object has no attribute '_df'"), traceback: None }

I think this is a similar question https://stackoverflow.com/questions/75678741/polars-aggregate-arraystring-column-into-setstr — Mohamed Yasser, Jun 23 '23 at 23:17

score 1 · Accepted Answer · answered Jun 23 '23 at 23:19

The most polars-esque way to do this would be to, as you say, do a groupby. From there though, common operation after a groupby is to do an aggregation. Here the unique instances of the disk column by group gives us a list column, which we can then use the list.join method to finish the operation off.

All together:

recordings.groupby("path_on_disk")
    .agg(disks=pl.col("disk").unique())
    .with_columns(pl.col("disks").list.join(", ")
)

shape: (2, 2)
┌──────────────┬────────────┐
│ path_on_disk ┆ disks      │
│ ---          ┆ ---        │
│ str          ┆ str        │
╞══════════════╪════════════╡
│ file_a.txt   ┆ NT23, NT24 │
│ file_b.txt   ┆ NT23, NT24 │
└──────────────┴────────────┘

(Note that the order of disks is essentially random the way groupby works, you can do a sort / maintain_order if desired.)

There's also [`.unique().str.concat(", ")`](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.str.concat.html#polars-expr-str-concat) which may be of interest. — jqurious, Jun 24 '23 at 00:35
Accepting as solved as this did mostly what I wanted. I had to add the other fields (`other_data`) to the `groupby` to get them to be included in the result. Thank you for the help--I need to better learn how `agg` works as it is a bit opaque to me. — RandyP, Jun 24 '23 at 21:14

How to create a polars column listing duplicates of another column

1 Answers1