I am new to Dask,
I have 152 parquet files 200MB on average.(32GB Machine RAM)
Each file has Timestamp column and I want to set that column to be the partition(index).
If I set the Timestamp column as index there are too many partitions, So I need to convert it to Date -->
ddf = dd.read_parquet('gs://bucket_name/*.parquet')
ddf['partition'] = dd.to_datetime(ddf['event_time'], format='%Y/%m/%d')
I ran successfully other operations like groupby and so on.
What is the best practice to handle this situation if I want the parquet file with partitions for fast query by partition?