Merge csv files using dask

Question

I am new to python. I am using dask to read 5 large (>1 GB) csv files and merge (SQL like) them into a dask dataframe. Now, I am trying to write the merged result into a single csv. I used compute() on dask dataframe to collect data into a single df and then call to_csv. However, compute() is slow in reading data across all partitions. I tried calling to_csv directly on dask df and it created multiple .part files (I didn't try merging those .part files into a csv). Is there any alternative to get dask df into a single csv or any parameter to compute() to gather data. I am using 6GB RAM with HDD and i5 processor.

Thanks

I'd guess that calling compute() is not slow because reading in the data takes long but because merging them might require a shuffle. The default scheduler might not be ideal for this. Did you try [other schedulers](http://dask.pydata.org/en/latest/scheduler-choice.html)? — Arco Bast, Mar 23 '17 at 10:15
I didn't explored other schedulers. I will try now. Thanks a lot — SRB, Mar 23 '17 at 13:39

score 4 · Answer 1 · answered Mar 23 '17 at 11:19

Dask.dataframe will not write to a single CSV file. As you mention it will write to multiple CSV files, one file per partition. Your solution of calling .compute().to_csv(...) would work, but calling .compute() converts the full dask.dataframe into a Pandas dataframe, which might fill up memory.

One option is to just avoid Pandas and Dask all-together and just read in bytes from multiple files and dump them to another file

with open(out_filename, 'w') as outfile:
    for in_filename in filenames:
        with open(in_filename, 'r') as infile:
            # if your csv files have headers then you might want to burn a line here with `next(infile)
            for line in infile:
                outfile.write(line + '\n')

If you don't need to do anything except for merge your CSV files into a larger one then I would just do this and not touch pandas/dask at all. They'll try to read the CSV data into in-memory data, which will take a while and which you don't need. If on the other hand you need to do some processing with pandas/dask then I would use dask.dataframe to read and process the data, write to many csv files, and then use the trick above to merge them afterwards.

You might also consider writing to a datastore other than CSV. Formats like HDF5 and Parquet can be much faster. http://dask.pydata.org/en/latest/dataframe-create.html

I merged the files using inner join on a particular column value using dask. I will try reading multiple .part files as you mentioned in above code snippet instead of using compute() directly. Also I will explore other formats. Thanks a lot — SRB, Mar 23 '17 at 13:44
You can probably speed this up using 3 threads, each working with a separate file with a shared queue for the processing of data. — denfromufa, Sep 20 '18 at 14:09

vtnate · Answer 2 · 2019-11-01T19:54:34.500

2

As of Dask 2.4.0 you may now specify single_file=True when calling to_csv. Example: dask_df.to_csv('path/to/csv.csv', single_file=True)

Like @mrocklin said, I recommend using other file formats.

edited Nov 01 '19 at 19:54

answered Nov 01 '19 at 19:38

vtnate

133
1
9

Merge csv files using dask

2 Answers2

Linked