1

I am new to Dask,

I have 152 parquet files 200MB on average.(32GB Machine RAM)

Each file has Timestamp column and I want to set that column to be the partition(index).

If I set the Timestamp column as index there are too many partitions, So I need to convert it to Date -->

ddf = dd.read_parquet('gs://bucket_name/*.parquet')
ddf['partition'] = dd.to_datetime(ddf['event_time'], format='%Y/%m/%d')

I ran successfully other operations like groupby and so on.

What is the best practice to handle this situation if I want the parquet file with partitions for fast query by partition?

MPA
  • 1,011
  • 7
  • 22

1 Answers1

0

This answer is going to be useful. Specifically, you want to set the timestamped column as index with specific frequency.

# note that specifying npartitions is optional, but
# can be useful if for some reason there are too
# many partitions
ddf = ddf.set_index('partition', npartitions=10)

# you can also repartition it to get the desired frequency
# (e.g. daily)
ddf = ddf.repartition(freq='1D')

Note that you can make this process a lot more efficient if your data is already sorted by datetime, see the details in the answer linked above.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46