5

I am reading CSV file (10 GB) using Dask. Then after performing some operations, I am Exporting file in CSV format using to_csv. But problem is that exporting this file is taking around 27 Minutes (According to ProgressBar Diagnostics).

CSV file includes 350 columns with one column of timestamp and other column's datatype are set to float64.

  • Machine Specs:
    • Intel i7-4610M @ 3.00 GHz
    • 8 GB DDR3 RAM
    • 500 GB SSD
    • Windows 10 Pro

I have tried exporting in separate files like to_csv('filename-*.csv') and also have tried without including .csv. So, Dask exports file with extension of .part. But doing this also takes same time as mentioned above.

I think this should not be an issue with I/O operations as I am using SSD. But I am not sure about that.

Here is my code (simplified):

df = dd.read_csv('path\\to\\csv')
# Doing some operations using df.loc
df.to_csv('export.csv', single_file=True)

I am using Dask v2.6.0.

Expected output --> complete this process in less time without changing specs of machine.

Is there anyway, I can export this file in less time?

Pritesh K.
  • 128
  • 1
  • 9

1 Answers1

2

By default dask dataframe uses the multi-threaded scheduler. This is optimal for most pandas operations, but read_csv partially holds onto the GIL, so you might want to try using the multi-processing or dask.distributed schedulers.

See more information about that here: https://docs.dask.org/en/latest/scheduling.html

If you can, I also recommend using a more efficient file format, like Parquet

https://docs.dask.org/en/latest/dataframe-best-practices.html#store-data-in-apache-parquet-format

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • 1
    Using Parquet is defiantly an improvement as it's taking now only around 3 Minutes. Using schedulers are cherry on top. More on `to_parquet` [Doc for to_parquet](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_parquet) – Pritesh K. Oct 24 '19 at 05:58