I have a very large dataset on disk as a csv file. I would like to load this into dask, do some cleaning, and then save the data for each value of date into a separate file/folder, as follows:
.
└── test
└── 20211201
└── part.0.parquet
└── 20211202
└── part.0.parquet
I'm struggling to figure out how to do this efficiently.
I've considered the approach of doing something like:
ddf = dd.read_csv('big_data.csv').map_partitions(clean_data)
ddf.to_parquet('test', partition_on='date')
and I get a directory structure as follows:
.
└── test
└── date=2021-12-01T00:00:00
└── part.0.parquet
└── date=2021-12-02T00:00:00
└── part.0.parquet
Notably, If I then try reading the "test/date=2021-12-02T00:00:00' files, I don't see a field corresponding to date. Additionally, I don't have control over the naming of the files. I could potentially loop back over these values afterwards, reading them in, renaming them, and writing them back out with the new date column, but that seems wasteful. Is that my best option?
I've also considered partitioning by the date column, and trying to loop over the partitions, writing them as I please, but then I think I would end up recomputing the full pipeline every time (unless you persist, but this dataset is too big to store in memory).
What are best practices for using dask to create a partitioned data like like this?