Difference between 2 Polars dataframes

Question

What is the best way to find the differences between 2 Polars dataframes? The frame_equal method tells me if there is a difference, I want to find where is the difference.

Example:

import polars as pl

df1 = pl.DataFrame([
    {'id': 1,'col1': ['a',None],'col2': ['x']},
    {'id': 2,'col1': ['b'],'col2': ['y', None]},
    {'id': 3,'col1': [None],'col2': ['z']}]
)

┌─────┬─────────────┬─────────────┐
│ id  ┆ col1        ┆ col2        │
│ --- ┆ ---         ┆ ---         │
│ i64 ┆ list[str]   ┆ list[str]   │
╞═════╪═════════════╪═════════════╡
│ 1   ┆ ["a", null] ┆ ["x"]       │
│ 2   ┆ ["b"]       ┆ ["y", null] │
│ 3   ┆ [null]      ┆ ["z"]       │
└─────┴─────────────┴─────────────┘


df2 = pl.DataFrame([
    {'id': 1,'col1': ['a'],'col2': ['x']},
    {'id': 2,'col1': ['b', None],'col2': ['y', None]},
    {'id': 3,'col1': [None],'col2': ['z']}]
)

┌─────┬─────────────┬─────────────┐
│ id  ┆ col1        ┆ col2        │
│ --- ┆ ---         ┆ ---         │
│ i64 ┆ list[str]   ┆ list[str]   │
╞═════╪═════════════╪═════════════╡
│ 1   ┆ ["a"]       ┆ ["x"]       │
│ 2   ┆ ["b", null] ┆ ["y", null] │
│ 3   ┆ [null]      ┆ ["z"]       │
└─────┴─────────────┴─────────────┘

The difference in the example is for id = 1 and id = 2.

I can join the dataframes:

df1.join(df2, on='id', suffix='_df2')
┌─────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ id  ┆ col1        ┆ col2        ┆ col1_df2    ┆ col2_df2    │
│ --- ┆ ---         ┆ ---         ┆ ---         ┆ ---         │
│ i64 ┆ list[str]   ┆ list[str]   ┆ list[str]   ┆ list[str]   │
╞═════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ 1   ┆ ["a", null] ┆ ["x"]       ┆ ["a"]       ┆ ["x"]       │
│ 2   ┆ ["b"]       ┆ ["y", null] ┆ ["b", null] ┆ ["y", null] │
│ 3   ┆ [null]      ┆ ["z"]       ┆ [null]      ┆ ["z"]       │
└─────┴─────────────┴─────────────┴─────────────┴─────────────┘

Expected result

I would like to either:

add a boolean columns that shows True in the rows with a difference
filter and only display rows with a difference.

The example has only 2 columns, but there are more columns in the dataframe.

Would this help? https://stackoverflow.com/questions/73273844/how-to-get-the-difference-sets-of-two-polars-dataframes — ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ, Feb 16 '23 at 19:09
hi @ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ, I had checked that answer but it was for a different problem. thank you though — Luca, Feb 16 '23 at 19:18

score 1 · Accepted Answer · answered Feb 16 '23 at 18:48

1

Here's the filter approach

df1.join(df2, on='id', suffix='_df2') \
    .filter(pl.any([pl.col(x)!=pl.col(f"{x}_df2") for x in df1.columns if x!='id']))

If you wanted the bool column then you just change the filter to with_columns and add an alias.

df1.join(df2, on='id', suffix='_df2') \
    .with_columns(pl.any([pl.col(x)!=pl.col(f"{x}_df2") for x in df1.columns if x!='id']).alias("has_diff"))

This assumes that each df has all the same columns other than 'id'.

answered Feb 16 '23 at 18:48

Dean MacGregor

11,847
9
34
72

that's exactly it. thanks @Dean MacGregor. would a pl.struct concatenating all columns also be a good approach? – Luca Feb 16 '23 at 19:17
1

I guess it depends how you're determining what a good approach would be. I don't think it would make it faster. – Dean MacGregor Feb 16 '23 at 19:39

score 1 · Answer 2 · answered Feb 17 '23 at 09:10

1

You can also use == to compare frames:

>>> df1 == df2
shape: (3, 3)
┌──────┬───────┬──────┐
│ id   | col1  | col2 │
│ ---  | ---   | ---  │
│ bool | bool  | bool │
╞══════╪═══════╪══════╡
│ true | false | true │
│ true | false | true │
│ true | true  | true │
└──────┴───────┴──────┘

answered Feb 17 '23 at 09:10

jqurious

9,953
1
4
14

That’s a good one. I didn’t know it was possible. I imagine I should run a sort before to ensure the ordering is the same . Then do a pl.any(~(df1 == df2)) to find the rows with at least one difference – Luca Feb 17 '23 at 11:00

Difference between 2 Polars dataframes

2 Answers2