0

I am trying to merge a number of large data sets using Dask in Python to avoid loading issues. I want to save as .csv the merged file. The task proves harder than imagined:

I put together a toy example with just two data sets The code I then use is the following:

import dask.dataframe as dd
import glob
import os

os.chdir('C:/Users/Me/Working directory')
file_list = glob.glob("*.txt")    

dfs = []
for file in file_list:
    ddf = dd.read_table(file, sep=';')
    dfs.append(ddf)

dd_all = dd.concat(dfs)

If I use dd_all.to_csv('*.csv') I simply print out the two original data sets. If I use dd_all.to_csv('name.csv') I get an error saying the file does not exist. (FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Me\\Working directory\\name.csv\\1.part')

I can check that using dd_all.compute() the merged data set had successfully been created.

MCS
  • 1,071
  • 9
  • 23

1 Answers1

0

You are misunderstanding how Dask works - the behaviour you see is as expected. In order to be able to write from multiple workers in parallel, it is necessary for each worker to be able to write to a separate file; there is no way to know the length of the first chunk before writing it has finished, for example. To write to a single file is therefore necessarily a sequential operation.

The default operation, therefore, is to write one output file for each input partition, and this is what you see. Since Dask can read from these in parallel, it does raise the question of why you would want to creation one output file at all.

For the second method without the "*" character, Dask is assuming that you are supplying a directory, not a file, and is trying to write two files within this directory, which doesn't exist.

If you really wanted to write a single file, you could do one of the following:

  • use the repartition method to make a single output piece and then to_csv
  • write the separate file and concatenate them after the fact (taking care of the header line)
  • iterate over the partitions of your dataframe in sequence to write to the same file.
mdurant
  • 27,272
  • 5
  • 45
  • 74
  • I tried to modified my code adding `dd_all = dd_all.repartition(npartitions=1)` before executing `to_csv`, which I was your first suggestion, if I am not wrong. Code is proving extremely slow, though (It's running since Friday for a number of tables that do not exceed 40Gb). Any idea on how improve perfomance? – MCS Nov 19 '18 at 15:09
  • Pandas can require a few times the raw size of your data for intermediates, and I assume you are giving the on-disc size not the in-memory size. You can run with the distributed scheduler (even in-process) to get access to more diagnostics; or choose one of the other methods. But, again, *why* are you doing this? – mdurant Nov 19 '18 at 16:05
  • Unfortunately, the production of a single-file dataset appears to be required for the project. Yes, I fear those are on-disc sizes. Where can I find the distributed scheduler guidelines? – MCS Nov 19 '18 at 16:27