Optimal way to get all duplicated rows in a polars dataframe

Question

I want to filter all duplicated rows from a polars dataframe. What I've tried:

df = pl.DataFrame([['1', '1', '1', '1'], ['7', '7', '2', '7'], ['3', '9', '3', '9']])
df
shape: (4, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 3        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 7        ┆ 9        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 2        ┆ 3        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

df.filter(pl.all().is_duplicated())
shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 3        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 7        ┆ 9        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

This selects the first row, because it appears to go column-by-column and returns each row where all columns have a corresponding duplicate in the respective column - not the intended outcome.

Boolean indexing works:

df[df.is_duplicated(), :]
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 9        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

But it leaves me wondering

if this is indeed the only way to do it,
if there's a way to use .filter() and expressions to achieve the same result
if this is the most efficient way to achieve the desired result

score 8 · Accepted Answer · 2022-05-05T13:14:06.780

In general, the is_duplicated method will likely perform best. Let's take a look at some alternative ways to accomplish this. And we'll do some (very) non-rigorous benchmarking - just to see which ones perform reasonably well.

Some alternatives

One alternative is a filter statement with an over (windowing) expression on all columns. One caution with windowed expressions - they are convenient, but can be costly performance-wise.

df.filter(pl.count("column_1").over(df.columns) > 1)

shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 9        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

Another alternative is a groupby, followed by a join. Basically, we'll count the number of times that combinations of columns occur. I'm using a semi join here, simply because I don't want to include the count column in my final results.

df.join(
    df=df.groupby(df.columns)
    .agg(pl.count().alias("count"))
    .filter(pl.col("count") > 1),
    on=df.columns,
    how="semi",
)

shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 9        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

Some (very) non-rigorous benchmarking

One way to see which alternatives perform reasonably well is to time the performance on a test dataset that might resemble the datasets that you will use. For lack of something better, I'll stick to something that looks close to the dataset in your question.

Set nbr_rows to something that will challenge your machine. (My machine is a 32-core system, so I'm going to choose a reasonably high number of rows.)

import numpy as np
import string

nbr_rows = 100_000_000
df = pl.DataFrame(
    {
        "col1": np.random.choice(1_000, nbr_rows,),
        "col2": np.random.choice(1_000, nbr_rows,),
        "col3": np.random.choice(list(string.ascii_letters), nbr_rows,),
        "col4": np.random.choice(1_000, nbr_rows,),
    }
)
print(df)

shape: (100000000, 4)
┌──────┬──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 ┆ col4 │
│ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ str  ┆ i64  │
╞══════╪══════╪══════╪══════╡
│ 955  ┆ 186  ┆ j    ┆ 851  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 530  ┆ 199  ┆ d    ┆ 376  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 109  ┆ 609  ┆ G    ┆ 115  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 886  ┆ 487  ┆ d    ┆ 479  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...  ┆ ...  ┆ ...  ┆ ...  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 837  ┆ 406  ┆ Y    ┆ 60   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 467  ┆ 769  ┆ P    ┆ 344  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 548  ┆ 372  ┆ F    ┆ 410  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 379  ┆ 578  ┆ t    ┆ 287  │
└──────┴──────┴──────┴──────┘

Now let's benchmark some alternatives. Since these may or may not resemble your datasets (or your computing platform), I won't run the benchmarks multiple times. For our purposes, we're just trying to weed out alternatives that might perform very poorly.

Alternative: `is_duplicated`

import time
start = time.perf_counter()
df[df.is_duplicated(),:]
end = time.perf_counter()
print(end - start)

>>> print(end - start)
7.834882180000932

Since the is_duplicated method is provided by the Polars API, we can be reasonably assured that it will perform very well. Indeed, this should be the standard against which we compare other alternatives.

Alternative: `filter` using an `over` (windowing) expression

start = time.perf_counter()
df.filter(pl.count("col1").over(df.columns) > 1)
end = time.perf_counter()
print(end - start)

>>> print(end - start)
18.136289041000055

As expected, the over (windowing) expression is rather costly.

Alternative: `groupby` followed by a `join`

start = time.perf_counter()
df.join(
    df=df.groupby(df.columns)
    .agg(pl.count().alias("count"))
    .filter(pl.col("count") > 1),
    on=df.columns,
    how="semi",
)
end = time.perf_counter()
print(end - start)

>>> print(end - start)
9.419006452999383

Somewhat better ... but not as good as using the is_duplicated method provided by the Polars API.

Alternative: `concat_str`

Let's also look at an alternative suggested in another answer. To be fair, @FBruzzesi did say "I am not sure this is optimal by any means". But let's look at how it performs.

start = time.perf_counter()
df.filter(pl.concat_str(df.columns, sep='|').is_duplicated())
end = time.perf_counter()
print(end - start)

>>> print(end - start)
37.238660977998734

Edit

Additional Alternative: `filter` and `is_duplicated`

We can also use filter with is_duplicated. Since df.is_duplicated() is not a column in the DataFrame when the filter is run, we'll need to wrap it in a polars.lit Expression.

start = time.perf_counter()
df.filter(pl.lit(df.is_duplicated()))
end = time.perf_counter()
print(end - start)

>>> print(end - start)
8.115436136999051

This performs just as well as using is_duplicated and boolean indexing.

Did this help? If nothing else, this shows some different ways to use the Polars API.

Thank you for the multiple alternatives along with benchmarks. The only "concern" is that the implementations using the filter method seem to be the slowest ones. I believe that OP, as myself, would like to use method chaining most of the time. Using the join method may be the compromise between performance and method chaining. — FBruzzesi, May 05 '22 at 12:30
You can use a filter with the is_duplicated as well. (I actually tested that, but thought that the post was getting way too long as it was.) I'll edit my answer to include that option. — , May 05 '22 at 12:40
Added (at the bottom). Using a filter will run just as fast as using boolean indexing. And to your point, the syntax if preferable. — , May 05 '22 at 12:53
When you want to verify duplicates with a subset of columns on a chain of expressions, using is_duplicated() is currently not possible. In this case, the filter()+over() option enables this feature, is convenient, clean and concise. I don't know what is the difference in performance between the groupBy()+join() solution and the filter()+over() solution in your (reader) particular use case. But unless this is the most critical part and bottleneck of your app, and you have evidence that there is a very significant difference in performance, I'd suggest that you favor the more clean solution. — Vinícius Queiroz, Mar 15 '23 at 14:44

Igor Marcos Riegel · Answer 2 · 2023-02-10T19:28:11.610

I think the optimal way really is:

df.filter(df.is_duplicated())

We have to be aware that Polars have is_duplicated() methods in the expression API and in the DataFrame API, but for the purpose of visualizing the duplicated lines we need to evaluate each column and have a consensus in the end if the column is duplicated or not.

The df.is_duplicated() will return a vector with boolean values, It looks like this:

In []: df.is_duplicated()
Out[]: 
shape: (4,)
Series: '' [bool]
[
    false
    true
    false
    true
]

In []:

Then leveraging our expression API I think we can do this:

In []: df = pl.DataFrame([['1', '1', '1', '1'], ['7', '7', '2', '7'], ['3', '9', '3', '9']])
    ...: df
Out[]: 
shape: (4, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 3        │
│ 1        ┆ 7        ┆ 9        │
│ 1        ┆ 2        ┆ 3        │
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘
In []: df.filter(df.is_duplicated())
Out[]: 
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 9        │
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

And then for subseting:

In []: df.filter(df.select(["column_0", "column_1"]).is_duplicated())
Out[]: 
shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ 1        ┆ 7        ┆ 3        │
│ 1        ┆ 7        ┆ 9        │
│ 1        ┆ 7        ┆ 9        │
└──────────┴──────────┴──────────┘

Any other way for me seems to be trying to hack the API, I wish we had a way to "fold" the boolean columns we get by the expressions df.filter(pl.col("*").is_duplicated()) df.filter(pl.all().is_duplicated())

It is funny, because look at this:

In []: df.select(pl.all().is_duplicated())
Out[]: 
shape: (4, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ ---      ┆ ---      ┆ ---      │
│ bool     ┆ bool     ┆ bool     │
╞══════════╪══════════╪══════════╡
│ true     ┆ true     ┆ true     │
│ true     ┆ true     ┆ true     │
│ true     ┆ false    ┆ true     │
│ true     ┆ true     ┆ true     │
└──────────┴──────────┴──────────┘

This is what happens when we use is_duplicated() as expression.

I'm not sure what exactly you mean by this: "I wish we had a way to "fold" the boolean columns we get by the expressions", but I believe there are options to do what you meant: (1): `df.select(pl.all(pl.col('*').is_duplicated()))` works horizontally, but its usefulness is debatable, since the duplicates are still evaluated vertically per column before the horizontal `pl.all()`. (2): `df.select(pl.struct(pl.col('*')).is_duplicated())` is probably what you want, since the rows are first combined horizontally in the `struct` before `.is_duplicated()` is called. — DataWiz, Mar 02 '23 at 10:49
I've never used `pl.struct` before, I'll try it out thank you. And I had the same Issue you meant using `pl.all()`. — Igor Marcos Riegel, Mar 03 '23 at 17:03

score 1 · Answer 3 · edited Mar 02 '23 at 12:21

I am not sure this is optimal by any mean but you could concatenate all rows and check for duplicates, namely:

import polars as pl

df = pl.DataFrame(
    [["1", "1", None, "1"], ["7", "7", None, "7"], ["3", None, None, None]],
    schema=["a", "b", "c"],
)

df.filter(
    pl.concat_str(pl.col(["a", "b", "c"]).fill_null("null")).is_duplicated()
)

shape: (2, 3)
┌─────┬─────┬──────┐
│ a   ┆ b   ┆ c    │
│ --- ┆ --- ┆ ---  │
│ str ┆ str ┆ str  │
╞═════╪═════╪══════╡
│ 1   ┆ 7   ┆ null │
│ 1   ┆ 7   ┆ null │
└─────┴─────┴──────┘

Caveat: fill_null("null") - without this, all rows containing null would be treated the same. This means, however, if you have the string "null" appearing in your data, it will be treated identically to an actual null-value.

Thanks for your suggestion. It is much more expensive than other options, but it is good to have it here for completeness' sake. I submitted an edit to account for `null`s. — DataWiz, Mar 02 '23 at 11:08

Optimal way to get all duplicated rows in a polars dataframe

3 Answers3

Some alternatives

Some (very) non-rigorous benchmarking

Alternative: is_duplicated

Alternative: filter using an over (windowing) expression

Alternative: groupby followed by a join

Alternative: concat_str