-2

I have very big parquet file with 25+Million rows and loaded the file using polars and did some transformation.

But getting out of memory error during transformations. So trying to run in batches.

below is sample data and code

df = pl.read_parquet('path_to_your_file.parquet')
# below is sample data, 
df = pl.DataFrame(
    {
        "id": ["1", "1", "1", "1", "2", "2", "2"],
        "points": ["v", "b", np.nan, np.nan, "d", np.nan, "c"],
        "rebounds": ["a", np.nan, "z", "m", np.nan, "e", np.nan],
    }
)

grouped_df = df.groupby("id")

def transformations(df):
    # Here actual transformation logic 
    return df

final_df = None

for chunk in grouped_df:
    df1 = chunk.to_pandas()
    batch_transform = transformations(df1)
    if final_df is None:
        final_df = batch_transform
    else:
        final_df = pd.concat([final_df, batch_transform])

print("Final final_df \n", final_df)

There are 2 million unique data in id. Above code running 1 batch for one id. so total 2 million batches.

a id may have 5 to 12 rows data. I want decrease total number of batches by increasing number of rows in a batch. Goal is to improve executive time.

I want to have 10K unique id in one batch.

How to split polars dataframe after groupby to have 10K unique id in one batch ?

Thanks

SomethingDark
  • 13,229
  • 5
  • 50
  • 55
Hari
  • 299
  • 4
  • 12
  • `.read_parquet()` loads everything into memory - [`.scan_parquet()`](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_parquet.html#polars.scan_parquet) can read the data lazily (and there's `.sink_parquet()` for writing.) – jqurious Aug 21 '23 at 20:28
  • @jqurious I started with scan_parquet but it didn't help. if i do collect the lazy frame befroe transformation then it is working but i had to configure 12 GB for the process. So tring batches to run with 8 GB memory. – Hari Aug 21 '23 at 21:19
  • You're asking how to increase the performance of a polars operation and your first step is converting it to pandas. First and foremost you need to convert your `transformations` to polars expressions. That will be orders of magnitude better for your performance than changing your chunking behavior. – Dean MacGregor Aug 22 '23 at 16:13

0 Answers0