1

What is the best way to find the differences between 2 Polars dataframes? The frame_equal method tells me if there is a difference, I want to find where is the difference.

Example:

import polars as pl

df1 = pl.DataFrame([
    {'id': 1,'col1': ['a',None],'col2': ['x']},
    {'id': 2,'col1': ['b'],'col2': ['y', None]},
    {'id': 3,'col1': [None],'col2': ['z']}]
)

┌─────┬─────────────┬─────────────┐
│ id  ┆ col1        ┆ col2        │
│ --- ┆ ---         ┆ ---         │
│ i64 ┆ list[str]   ┆ list[str]   │
╞═════╪═════════════╪═════════════╡
│ 1   ┆ ["a", null] ┆ ["x"]       │
│ 2   ┆ ["b"]       ┆ ["y", null] │
│ 3   ┆ [null]      ┆ ["z"]       │
└─────┴─────────────┴─────────────┘


df2 = pl.DataFrame([
    {'id': 1,'col1': ['a'],'col2': ['x']},
    {'id': 2,'col1': ['b', None],'col2': ['y', None]},
    {'id': 3,'col1': [None],'col2': ['z']}]
)

┌─────┬─────────────┬─────────────┐
│ id  ┆ col1        ┆ col2        │
│ --- ┆ ---         ┆ ---         │
│ i64 ┆ list[str]   ┆ list[str]   │
╞═════╪═════════════╪═════════════╡
│ 1   ┆ ["a"]       ┆ ["x"]       │
│ 2   ┆ ["b", null] ┆ ["y", null] │
│ 3   ┆ [null]      ┆ ["z"]       │
└─────┴─────────────┴─────────────┘

The difference in the example is for id = 1 and id = 2.

I can join the dataframes:

df1.join(df2, on='id', suffix='_df2')
┌─────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ id  ┆ col1        ┆ col2        ┆ col1_df2    ┆ col2_df2    │
│ --- ┆ ---         ┆ ---         ┆ ---         ┆ ---         │
│ i64 ┆ list[str]   ┆ list[str]   ┆ list[str]   ┆ list[str]   │
╞═════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ 1   ┆ ["a", null] ┆ ["x"]       ┆ ["a"]       ┆ ["x"]       │
│ 2   ┆ ["b"]       ┆ ["y", null] ┆ ["b", null] ┆ ["y", null] │
│ 3   ┆ [null]      ┆ ["z"]       ┆ [null]      ┆ ["z"]       │
└─────┴─────────────┴─────────────┴─────────────┴─────────────┘

Expected result

I would like to either:

  • add a boolean columns that shows True in the rows with a difference
  • filter and only display rows with a difference.

The example has only 2 columns, but there are more columns in the dataframe.

Luca
  • 1,216
  • 6
  • 10

2 Answers2

1

Here's the filter approach

df1.join(df2, on='id', suffix='_df2') \
    .filter(pl.any([pl.col(x)!=pl.col(f"{x}_df2") for x in df1.columns if x!='id']))

If you wanted the bool column then you just change the filter to with_columns and add an alias.

df1.join(df2, on='id', suffix='_df2') \
    .with_columns(pl.any([pl.col(x)!=pl.col(f"{x}_df2") for x in df1.columns if x!='id']).alias("has_diff"))

This assumes that each df has all the same columns other than 'id'.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • that's exactly it. thanks @Dean MacGregor. would a pl.struct concatenating all columns also be a good approach? – Luca Feb 16 '23 at 19:17
  • 1
    I guess it depends how you're determining what a good approach would be. I don't think it would make it faster. – Dean MacGregor Feb 16 '23 at 19:39
1

You can also use == to compare frames:

>>> df1 == df2
shape: (3, 3)
┌──────┬───────┬──────┐
│ id   | col1  | col2 │
│ ---  | ---   | ---  │
│ bool | bool  | bool │
╞══════╪═══════╪══════╡
│ true | false | true │
│ true | false | true │
│ true | true  | true │
└──────┴───────┴──────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • That’s a good one. I didn’t know it was possible. I imagine I should run a sort before to ensure the ordering is the same . Then do a pl.any(~(df1 == df2)) to find the rows with at least one difference – Luca Feb 17 '23 at 11:00