I have very big parquet file with 25+Million rows and loaded the file using polars and did some transformation.
But getting out of memory error during transformations. So trying to run in batches.
below is sample data and code
df = pl.read_parquet('path_to_your_file.parquet')
# below is sample data,
df = pl.DataFrame(
{
"id": ["1", "1", "1", "1", "2", "2", "2"],
"points": ["v", "b", np.nan, np.nan, "d", np.nan, "c"],
"rebounds": ["a", np.nan, "z", "m", np.nan, "e", np.nan],
}
)
grouped_df = df.groupby("id")
def transformations(df):
# Here actual transformation logic
return df
final_df = None
for chunk in grouped_df:
df1 = chunk.to_pandas()
batch_transform = transformations(df1)
if final_df is None:
final_df = batch_transform
else:
final_df = pd.concat([final_df, batch_transform])
print("Final final_df \n", final_df)
There are 2 million unique data in id
. Above code running 1 batch for one id
. so total 2 million batches.
a id
may have 5 to 12 rows data. I want decrease total number of batches by increasing number of rows in a batch. Goal is to improve executive time.
I want to have 10K unique id
in one batch.
How to split polars dataframe after groupby to have 10K unique id
in one batch ?
Thanks