I have a dataset that fits into RAM, but causes an out of memory error when I run certain methods, such as df.unique()
. My laptop has 16GB of RAM. I am running WSL with 14GB of RAM. I am using Polars version 0.18.4. Running df.estimated_size()
says that my dataset is around 6GBs when I read it in. The schema of my data is
index: Int64
first_name: Utf8
last_name: Utf8
race: Utf8
pct_1: Float64
pct_2: Float64
pct_3: Float64
pct_4: Float64
size = pl.read_parquet("data.parquet").estimated_size()
df = pl.scan_parquet("data.parquet") # use LazyFrames
However, I am unable to perform tasks such as .unique()
, .drop_nulls()
, and so on without getting SIGKILLed. I am using LazyFrames.
For example,
df = df.drop_nulls().collect(streaming=True)
results in an out of memory error. I am able to sidestep this by writing a custom function.
def iterative_drop_nulls(expr: pl.Expr, subset: list[str]) -> pl.LazyFrame:
for col in subset:
expr = expr.filter(~pl.col(col).is_null())
return expr
df = df.pipe(iterative_drop_nulls, ["col1", "col2"]).collect()
I am quite curious why the latter works but not the former, given that the largest version of the dataset (when I read it in initially) fits into RAM.
Unfortunately, I am unable to think of a similar trick to do the same thing as .unique()
. Is there something I can do to make .unique()
take less memory? I have tried:
df = df.lazy().unique(cols).collect(streaming=True)
and
def unique(df: pl.DataFrame, subset: list[str], n_rows: int = 100_000) -> pl.DataFrame:
parts = []
for slice in df.iter_slices(n_rows=n_rows):
parts.append(df.unique(slice, subset=subset))
return pl.concat(parts)
Edit:
I would love a better answer, but for now I am using
df = pl.from_pandas(
df.collect()
.to_pandas()
.drop_duplicates(subset=["col1", "col2"])
)
In general I have found Polars to be more memory efficient than Pandas, but maybe this is an area Polars could improve? Curiously, if I use
df = pl.from_pandas(
df.collect()
.to_pandas(use_pyarrow_extension_array=True)
.drop_duplicates(subset=["col1", "col2"])
)
I get the same memory error, so maybe this is a Pyarrow thing.