how to get the difference sets of two polars dataframes

Question

with pandas

pd.concat([df1, df2], axis=0).drop_duplicates(subset=['name1','name2'],keep=False)

How to achieve the same function through polars?

score 2 · Answer 1 · 2022-08-08T08:16:45.000

If we start with these two datasets:

import polars as pl
import pandas as pd

df1 = pl.DataFrame(
    {
        "col1": [1, 2, 3, 4, 5, 6],
        "col2": ["a", "b", "c", "d", "e", "f"],
        "col3": [100, 200, 300, 400, 500, 600],
    }
)
df1

df2 = pl.DataFrame(
    {
        "col1": [1, 3, 2, 4, 6],
        "col2": ["a", "c", "b", "z", "e"],
        "col3": [10, 30, 20, 40, 50],
    }
)
df2

>>> df1
shape: (6, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ str  ┆ i64  │
╞══════╪══════╪══════╡
│ 1    ┆ a    ┆ 100  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2    ┆ b    ┆ 200  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3    ┆ c    ┆ 300  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4    ┆ d    ┆ 400  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 5    ┆ e    ┆ 500  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 6    ┆ f    ┆ 600  │
└──────┴──────┴──────┘

shape: (5, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ str  ┆ i64  │
╞══════╪══════╪══════╡
│ 1    ┆ a    ┆ 10   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3    ┆ c    ┆ 30   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2    ┆ b    ┆ 20   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4    ┆ z    ┆ 40   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 6    ┆ e    ┆ 50   │
└──────┴──────┴──────┘

We can express the difference sets in Polars as:

(
    pl.concat([df1.with_row_count(), df2.with_row_count()])
    .filter(pl.count().over(['col1', 'col2']) == 1)
)

shape: (5, 4)
┌────────┬──────┬──────┬──────┐
│ row_nr ┆ col1 ┆ col2 ┆ col3 │
│ ---    ┆ ---  ┆ ---  ┆ ---  │
│ u32    ┆ i64  ┆ str  ┆ i64  │
╞════════╪══════╪══════╪══════╡
│ 3      ┆ 4    ┆ d    ┆ 400  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4      ┆ 5    ┆ e    ┆ 500  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 5      ┆ 6    ┆ f    ┆ 600  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3      ┆ 4    ┆ z    ┆ 40   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4      ┆ 6    ┆ e    ┆ 50   │
└────────┴──────┴──────┴──────┘

Pandas would give this output.

pd.concat([df1.to_pandas(), df2.to_pandas()], axis=0).drop_duplicates(
    subset=["col1", "col2"], keep=False
)

>>> pd.concat([df1.to_pandas(), df2.to_pandas()], axis=0).drop_duplicates(
...     subset=["col1", "col2"], keep=False
... 
... )
   col1 col2  col3
3     4    d   400
4     5    e   500
5     6    f   600
3     4    z    40
4     6    e    50

I edited my answer to show only the solution using Expressions, as using Expressions is more elegant. — , Aug 08 '22 at 08:17

how to get the difference sets of two polars dataframes

1 Answers1

Linked