What is the best way to find the differences between 2 Polars dataframes?
The frame_equal
method tells me if there is a difference, I want to find where is the difference.
Example:
import polars as pl
df1 = pl.DataFrame([
{'id': 1,'col1': ['a',None],'col2': ['x']},
{'id': 2,'col1': ['b'],'col2': ['y', None]},
{'id': 3,'col1': [None],'col2': ['z']}]
)
┌─────┬─────────────┬─────────────┐
│ id ┆ col1 ┆ col2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[str] ┆ list[str] │
╞═════╪═════════════╪═════════════╡
│ 1 ┆ ["a", null] ┆ ["x"] │
│ 2 ┆ ["b"] ┆ ["y", null] │
│ 3 ┆ [null] ┆ ["z"] │
└─────┴─────────────┴─────────────┘
df2 = pl.DataFrame([
{'id': 1,'col1': ['a'],'col2': ['x']},
{'id': 2,'col1': ['b', None],'col2': ['y', None]},
{'id': 3,'col1': [None],'col2': ['z']}]
)
┌─────┬─────────────┬─────────────┐
│ id ┆ col1 ┆ col2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[str] ┆ list[str] │
╞═════╪═════════════╪═════════════╡
│ 1 ┆ ["a"] ┆ ["x"] │
│ 2 ┆ ["b", null] ┆ ["y", null] │
│ 3 ┆ [null] ┆ ["z"] │
└─────┴─────────────┴─────────────┘
The difference in the example is for id = 1 and id = 2.
I can join the dataframes:
df1.join(df2, on='id', suffix='_df2')
┌─────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ id ┆ col1 ┆ col2 ┆ col1_df2 ┆ col2_df2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ list[str] ┆ list[str] ┆ list[str] ┆ list[str] │
╞═════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ 1 ┆ ["a", null] ┆ ["x"] ┆ ["a"] ┆ ["x"] │
│ 2 ┆ ["b"] ┆ ["y", null] ┆ ["b", null] ┆ ["y", null] │
│ 3 ┆ [null] ┆ ["z"] ┆ [null] ┆ ["z"] │
└─────┴─────────────┴─────────────┴─────────────┴─────────────┘
Expected result
I would like to either:
- add a boolean columns that shows True in the rows with a difference
- filter and only display rows with a difference.
The example has only 2 columns, but there are more columns in the dataframe.