Using a function defined in a Python context manager, I want to modify a Polars dataframe by reassignment. I then went the function in the context manager to print the previous and new row counts.
I tried the following:
import polars as pl
def count_rows(df: pl.DataFrame) -> int:
""" Counts the number of rows in a polars dataframe. """
return df.select(pl.count()).item()
# define the function I want to work
@contextmanager
def log_row_count_change(df: pl.DataFrame, action_desc: str = '', df_name: str = 'df') -> None:
""" An easy way to log how many rows were added or removed from a dataframe during filters, joins, etc. """
try:
row_count_before = count_rows(df)
logger.debug(f"Before '{action_desc}' action on '{df_name}', row count: {row_count_before:,}")
yield
finally:
row_count_after = count_rows(df)
row_count_change = row_count_after - row_count_before
row_count_change_pct = row_count_change / row_count_before * 100
print(f"During '{action_desc}' action on '{df_name}', row count changed by {row_count_change:,} rows ({row_count_before:,} -> {row_count_after:,}) ({row_count_change_pct:.2f}%).")
# define a dataframe for testing
df = pl.DataFrame({"a":[1,1,2], "b":[2,2,3], "c":[1,2,3]})
# call the main part
with log_row_count_change(df, 'drop duplicates on column a', 'df'):
df = df.unique(subset=['a'])
When you run the above, it shows the row count equal to 3 both before and after. I want it to show a row count of 3 before and 2 after.