3

I am trying to split a parquet file using DASK with the following piece of code

import dask.dataframe as pd
df = pd.read_parquet(dataset_path, chunksize="100MB")
df.repartition(partition_size="100MB")
pd.to_parquet(df,output_path)

I have only one physical file in input, i.e. file.parquet

The output of this script is as well only one file, i.e. part.0.parquet.

Based on the partition_size & chunksize parameters, I should have multiple files in output

Any help would be appreciated

Serge
  • 87
  • 2
  • 10

1 Answers1

6

df.repartition(partition_size="100MB") returns a Dask Dataframe.

You have to write :

df = df.repartition(partition_size="100MB")

You can check the number of partitions created looking at df.npartitions

Also, you can use the following to write your parquet files :

df.to_parquet(output_path)

Because Parquet files are meant to deal with large files, you should also consider using the argument compression= when writing you parquet files.

You should get what you expect.

NB: Writing import dask.dataframe as pd is missleading because import dask.dataframe as dd is commonly used

DavidK
  • 2,495
  • 3
  • 23
  • 38