5

I have a dataset that fits into RAM, but causes an out of memory error when I run certain methods, such as df.unique(). My laptop has 16GB of RAM. I am running WSL with 14GB of RAM. I am using Polars version 0.18.4. Running df.estimated_size() says that my dataset is around 6GBs when I read it in. The schema of my data is

index: Int64
first_name: Utf8
last_name: Utf8
race: Utf8
pct_1: Float64
pct_2: Float64
pct_3: Float64
pct_4: Float64
size = pl.read_parquet("data.parquet").estimated_size()
df = pl.scan_parquet("data.parquet") # use LazyFrames

However, I am unable to perform tasks such as .unique(), .drop_nulls(), and so on without getting SIGKILLed. I am using LazyFrames.

For example,

df = df.drop_nulls().collect(streaming=True)

results in an out of memory error. I am able to sidestep this by writing a custom function.

def iterative_drop_nulls(expr: pl.Expr, subset: list[str]) -> pl.LazyFrame:
    for col in subset:
        expr = expr.filter(~pl.col(col).is_null())

    return expr

df = df.pipe(iterative_drop_nulls, ["col1", "col2"]).collect()

I am quite curious why the latter works but not the former, given that the largest version of the dataset (when I read it in initially) fits into RAM.

Unfortunately, I am unable to think of a similar trick to do the same thing as .unique(). Is there something I can do to make .unique() take less memory? I have tried:

df = df.lazy().unique(cols).collect(streaming=True)

and

def unique(df: pl.DataFrame, subset: list[str], n_rows: int = 100_000) -> pl.DataFrame:
    parts = []
    for slice in df.iter_slices(n_rows=n_rows):
        parts.append(df.unique(slice, subset=subset))

    return pl.concat(parts)

Edit:

I would love a better answer, but for now I am using

df = pl.from_pandas(
    df.collect()
    .to_pandas()
    .drop_duplicates(subset=["col1", "col2"])
)

In general I have found Polars to be more memory efficient than Pandas, but maybe this is an area Polars could improve? Curiously, if I use

df = pl.from_pandas(
    df.collect()
    .to_pandas(use_pyarrow_extension_array=True)
    .drop_duplicates(subset=["col1", "col2"])
)

I get the same memory error, so maybe this is a Pyarrow thing.

stressed
  • 328
  • 2
  • 7
  • Unique uses a sort which takes some memory. If you're set on using unique, the best approach likely depends on the size of your data. If each element is large, use a hash. If it is small, subdivide the problem and join. https://github.com/numpy/numpy/blob/db4f43983cb938f12c311e1f5b7165e270c393b4/numpy/lib/arraysetops.py#L323 – Carbon Jun 25 '23 at 20:14
  • Have you tried just eagerly reading it? There are different algorithms for streaming and eager. – Dean MacGregor Jun 25 '23 at 22:12
  • I'm confused about the question title stressed "Reduce polars memory consumption in unique()". Are you asking us to 1. reduce the memory consumption of the function itself? Or 2. how you can do something similar without an out of memory error or 3. how you can still use unique() and get around this? – Mark Jun 25 '23 at 23:25
  • ``` df = df.lazy().unique(cols).collect(streaming=True) ``` Should spill to disk for the `unique` if there is not enough memory. Can it be that the results don't fit into memory? Which polars version do you use? And what is the schema of your data? – ritchie46 Jun 26 '23 at 06:31
  • @Mark I found a workaround but I would love to know why I can't use unique() (or a way for me to use unique()). I also have tried eagerly reading it – stressed Jun 26 '23 at 13:16
  • @ritchie46 The results fit into memory. The original dataset fits into memory, and unique at worst keeps the size the same (also the pandas version I posted in the edit works). I edited my post with the polars version (0.18.4) and schema (Int64, 3 Utf8, 4 Float64). I will try to create a reproducible example when I have a bit more time. – stressed Jun 26 '23 at 13:31
  • 1
    @stressed might be worth opening another question for that! I honestly have no clue sorry! – Mark Jun 26 '23 at 13:32
  • it could be the case that pandas made a different trade off around memory use versus speed. it could be the case that the polars one is poorly implemented! – Mark Jun 26 '23 at 13:38

0 Answers0