1

I have a very large dataset on disk as a csv file. I would like to load this into dask, do some cleaning, and then save the data for each value of date into a separate file/folder, as follows:

.
└── test
    └── 20211201
        └── part.0.parquet
    └── 20211202
        └── part.0.parquet

I'm struggling to figure out how to do this efficiently.

I've considered the approach of doing something like:

ddf = dd.read_csv('big_data.csv').map_partitions(clean_data)
ddf.to_parquet('test', partition_on='date')

and I get a directory structure as follows:

.
└── test
    └── date=2021-12-01T00:00:00
        └── part.0.parquet
    └── date=2021-12-02T00:00:00
        └── part.0.parquet

Notably, If I then try reading the "test/date=2021-12-02T00:00:00' files, I don't see a field corresponding to date. Additionally, I don't have control over the naming of the files. I could potentially loop back over these values afterwards, reading them in, renaming them, and writing them back out with the new date column, but that seems wasteful. Is that my best option?

I've also considered partitioning by the date column, and trying to loop over the partitions, writing them as I please, but then I think I would end up recomputing the full pipeline every time (unless you persist, but this dataset is too big to store in memory).

What are best practices for using dask to create a partitioned data like like this?

Nezo
  • 567
  • 4
  • 18

2 Answers2

1

I'd like to point out that in the original output data tree, you can access the whole dataset in one go:

dd.read_parquet("test")

and you will see that the date field exists, parsed from the directory names. You can select for particular date values in read_parquet or afterwards, and you will only load those files that meet the condition. That is the point - you do not need to copy the values of the partitioning field into every data file.

mdurant
  • 27,272
  • 5
  • 45
  • 74
0

I figured it out. Map_partitions takes a function, and that function can take an optional partition_info argument (which contains the partition key). I wrote a function to save the dataset with the desired name (and reset the index so the key field appears), and used map_partitions on this function.

Nezo
  • 567
  • 4
  • 18