- with pandas
pd.concat([df1, df2], axis=0).drop_duplicates(subset=['name1','name2'],keep=False)
- How to achieve the same function through polars?
pd.concat([df1, df2], axis=0).drop_duplicates(subset=['name1','name2'],keep=False)
If we start with these two datasets:
import polars as pl
import pandas as pd
df1 = pl.DataFrame(
{
"col1": [1, 2, 3, 4, 5, 6],
"col2": ["a", "b", "c", "d", "e", "f"],
"col3": [100, 200, 300, 400, 500, 600],
}
)
df1
df2 = pl.DataFrame(
{
"col1": [1, 3, 2, 4, 6],
"col2": ["a", "c", "b", "z", "e"],
"col3": [10, 30, 20, 40, 50],
}
)
df2
>>> df1
shape: (6, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞══════╪══════╪══════╡
│ 1 ┆ a ┆ 100 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ b ┆ 200 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ c ┆ 300 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4 ┆ d ┆ 400 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 5 ┆ e ┆ 500 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 6 ┆ f ┆ 600 │
└──────┴──────┴──────┘
shape: (5, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞══════╪══════╪══════╡
│ 1 ┆ a ┆ 10 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ c ┆ 30 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ b ┆ 20 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4 ┆ z ┆ 40 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 6 ┆ e ┆ 50 │
└──────┴──────┴──────┘
We can express the difference sets in Polars as:
(
pl.concat([df1.with_row_count(), df2.with_row_count()])
.filter(pl.count().over(['col1', 'col2']) == 1)
)
shape: (5, 4)
┌────────┬──────┬──────┬──────┐
│ row_nr ┆ col1 ┆ col2 ┆ col3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str ┆ i64 │
╞════════╪══════╪══════╪══════╡
│ 3 ┆ 4 ┆ d ┆ 400 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4 ┆ 5 ┆ e ┆ 500 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 5 ┆ 6 ┆ f ┆ 600 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ z ┆ 40 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4 ┆ 6 ┆ e ┆ 50 │
└────────┴──────┴──────┴──────┘
Pandas would give this output.
pd.concat([df1.to_pandas(), df2.to_pandas()], axis=0).drop_duplicates(
subset=["col1", "col2"], keep=False
)
>>> pd.concat([df1.to_pandas(), df2.to_pandas()], axis=0).drop_duplicates(
... subset=["col1", "col2"], keep=False
...
... )
col1 col2 col3
3 4 d 400
4 5 e 500
5 6 f 600
3 4 z 40
4 6 e 50