I have hard a hard time searching for the answer to this as I find it hard to put into words.
I'm trying to aggregate multiple listings of files on disks, some of which have the same files. I want only one row for a given file, and a separate column with the disks that file may be on.
Say I have the following DataFrame:
recordings = pl.DataFrame(
{
"disk": ["NT23", "NT24", "NT23", "NT24"],
"path_on_disk": ["file_a.txt", "file_a.txt", "file_b.txt", "file_b.txt"],
"other_data": [2.0, 2.0, 3.0, 3.0],
}
)
Which looks something like this (SO doesn't like terminal characters I guess):
┌──────┬──────────────┬────────────┐
│ disk ┆ path_on_disk ┆ other_data │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞══════╪══════════════╪════════════╡
│ NT23 ┆ file_a.txt ┆ 2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NT24 ┆ file_a.txt ┆ 2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NT23 ┆ file_b.txt ┆ 3.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NT24 ┆ file_b.txt ┆ 3.0 │
└──────┴──────────────┴────────────┘
- Actual df has > 500k rows
- Actual df has more columns, but the if
path_on_disk
matches in two or more rows then we will assume that the rest of the fields in those rows match, except fordisk
.
I want to:
- Find all rows where
path_on_disk
are the same - Make a new column
disks
containing the different values ofdisk
joined together with", ".join()
Something like this:
┌──────────────┬────────────┬────────────┐
│ path_on_disk ┆ disks ┆ other_data │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞══════════════╪════════════╪════════════╡
│ file_a ┆ NT23, NT24 ┆ 2.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ file_b ┆ NT23, NT24 ┆ 3.0 │
└──────────────┴────────────┴────────────┘
I've figured out that I can use recordings.groupby(["path_on_disk"])
to accomplish the first objective:
for group_df in recordings.groupby(["path_on_disk"]):
if len(group_df) > 1:
print(group_df)
break
This shows the first found group of rows where path_on_disk
match.
I tried the following, but got an error:
def merge_disknames(df: pl.DataFrame):
return ", ".join(sorted(df["disk"]))
recordings.groupby(["path_on_disk"]).apply(merge_disknames).rename("disks")
PanicException: Could net get DataFrame attribute '_df'. Make sure that you return a DataFrame object.: PyErr { type: <class 'AttributeError'>, value: AttributeError("'str' object has no attribute '_df'"), traceback: None }