Split a parquet file in smaller chunks using dask

Question

I am trying to split a parquet file using DASK with the following piece of code

import dask.dataframe as pd
df = pd.read_parquet(dataset_path, chunksize="100MB")
df.repartition(partition_size="100MB")
pd.to_parquet(df,output_path)

I have only one physical file in input, i.e. file.parquet

The output of this script is as well only one file, i.e. part.0.parquet.

Based on the partition_size & chunksize parameters, I should have multiple files in output

Any help would be appreciated

DavidK · Accepted Answer · 2020-01-24T23:04:38.807

6

df.repartition(partition_size="100MB") returns a Dask Dataframe.

You have to write :

df = df.repartition(partition_size="100MB")

You can check the number of partitions created looking at df.npartitions

Also, you can use the following to write your parquet files :

df.to_parquet(output_path)

Because Parquet files are meant to deal with large files, you should also consider using the argument compression= when writing you parquet files.

You should get what you expect.

NB: Writing import dask.dataframe as pd is missleading because import dask.dataframe as dd is commonly used

edited Jan 24 '20 at 23:04

answered Jan 24 '20 at 22:59

DavidK

2,495
3
23
38

Chunksize is being deprecated. How could I get the same example by the new available features? – FábioRB Dec 03 '22 at 01:39
You don't have to use chunksize; dask is meant to handle very large files – DavidK Apr 20 '23 at 08:10

Split a parquet file in smaller chunks using dask

1 Answers1