In general, the is_duplicated
method will likely perform best. Let's take a look at some alternative ways to accomplish this. And we'll do some (very) non-rigorous benchmarking - just to see which ones perform reasonably well.
Some alternatives
One alternative is a filter
statement with an over
(windowing) expression on all columns. One caution with windowed expressions - they are convenient, but can be costly performance-wise.
df.filter(pl.count("column_1").over(df.columns) > 1)
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 9 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
Another alternative is a groupby
, followed by a join
. Basically, we'll count the number of times that combinations of columns occur. I'm using a semi
join here, simply because I don't want to include the count
column in my final results.
df.join(
df=df.groupby(df.columns)
.agg(pl.count().alias("count"))
.filter(pl.col("count") > 1),
on=df.columns,
how="semi",
)
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 9 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
Some (very) non-rigorous benchmarking
One way to see which alternatives perform reasonably well is to time the performance on a test dataset that might resemble the datasets that you will use. For lack of something better, I'll stick to something that looks close to the dataset in your question.
Set nbr_rows
to something that will challenge your machine. (My machine is a 32-core system, so I'm going to choose a reasonably high number of rows.)
import numpy as np
import string
nbr_rows = 100_000_000
df = pl.DataFrame(
{
"col1": np.random.choice(1_000, nbr_rows,),
"col2": np.random.choice(1_000, nbr_rows,),
"col3": np.random.choice(list(string.ascii_letters), nbr_rows,),
"col4": np.random.choice(1_000, nbr_rows,),
}
)
print(df)
shape: (100000000, 4)
┌──────┬──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 ┆ col4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╡
│ 955 ┆ 186 ┆ j ┆ 851 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 530 ┆ 199 ┆ d ┆ 376 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 109 ┆ 609 ┆ G ┆ 115 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 886 ┆ 487 ┆ d ┆ 479 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 837 ┆ 406 ┆ Y ┆ 60 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 467 ┆ 769 ┆ P ┆ 344 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 548 ┆ 372 ┆ F ┆ 410 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 379 ┆ 578 ┆ t ┆ 287 │
└──────┴──────┴──────┴──────┘
Now let's benchmark some alternatives. Since these may or may not resemble your datasets (or your computing platform), I won't run the benchmarks multiple times. For our purposes, we're just trying to weed out alternatives that might perform very poorly.
Alternative: is_duplicated
import time
start = time.perf_counter()
df[df.is_duplicated(),:]
end = time.perf_counter()
print(end - start)
>>> print(end - start)
7.834882180000932
Since the is_duplicated
method is provided by the Polars API, we can be reasonably assured that it will perform very well. Indeed, this should be the standard against which we compare other alternatives.
Alternative: filter
using an over
(windowing) expression
start = time.perf_counter()
df.filter(pl.count("col1").over(df.columns) > 1)
end = time.perf_counter()
print(end - start)
>>> print(end - start)
18.136289041000055
As expected, the over
(windowing) expression is rather costly.
Alternative: groupby
followed by a join
start = time.perf_counter()
df.join(
df=df.groupby(df.columns)
.agg(pl.count().alias("count"))
.filter(pl.col("count") > 1),
on=df.columns,
how="semi",
)
end = time.perf_counter()
print(end - start)
>>> print(end - start)
9.419006452999383
Somewhat better ... but not as good as using the is_duplicated
method provided by the Polars API.
Alternative: concat_str
Let's also look at an alternative suggested in another answer. To be fair, @FBruzzesi did say "I am not sure this is optimal by any means". But let's look at how it performs.
start = time.perf_counter()
df.filter(pl.concat_str(df.columns, sep='|').is_duplicated())
end = time.perf_counter()
print(end - start)
>>> print(end - start)
37.238660977998734
Edit
Additional Alternative: filter
and is_duplicated
We can also use filter
with is_duplicated
. Since df.is_duplicated()
is not a column in the DataFrame when the filter is run, we'll need to wrap it in a polars.lit
Expression.
start = time.perf_counter()
df.filter(pl.lit(df.is_duplicated()))
end = time.perf_counter()
print(end - start)
>>> print(end - start)
8.115436136999051
This performs just as well as using is_duplicated
and boolean indexing.
Did this help? If nothing else, this shows some different ways to use the Polars API.