Dask - Convert Timestamp column to date and set as index killed the process

Question

I am new to Dask,

I have 152 parquet files 200MB on average.(32GB Machine RAM)

Each file has Timestamp column and I want to set that column to be the partition(index).

If I set the Timestamp column as index there are too many partitions, So I need to convert it to Date -->

ddf = dd.read_parquet('gs://bucket_name/*.parquet')
ddf['partition'] = dd.to_datetime(ddf['event_time'], format='%Y/%m/%d')

I ran successfully other operations like groupby and so on.

What is the best practice to handle this situation if I want the parquet file with partitions for fast query by partition?

score 0 · Answer 1 · answered Apr 28 '21 at 15:20

This answer is going to be useful. Specifically, you want to set the timestamped column as index with specific frequency.

# note that specifying npartitions is optional, but
# can be useful if for some reason there are too
# many partitions
ddf = ddf.set_index('partition', npartitions=10)

# you can also repartition it to get the desired frequency
# (e.g. daily)
ddf = ddf.repartition(freq='1D')

Note that you can make this process a lot more efficient if your data is already sorted by datetime, see the details in the answer linked above.

the problem is the kernel died before i can repartition – MPA Apr 28 '21 at 17:23 — MPA, Apr 28 '21 at 17:23

Dask - Convert Timestamp column to date and set as index killed the process

1 Answers1