Dask read csv versus pandas read csv

Question

I have the following problem. I have a huge csv file and want to load it with multiprocessing. Pandas needs 19 seconds for a example file with 500000 rows and 130 colums with different dtypes. I tried dask because i want to multiprocess the reading. But this tooks much longer and I wonder why. I have 32 cores. and tried this:

import dask.dataframe as dd
import dask.multiprocessing
dask.config.set(scheduler='processes')
df = dd.read_csv(filepath,  
             sep='\t',
            blocksize=1000000,
             )
df = df.compute(scheduler='processes')     # convert to pandas

As Serge rightfully pointed out, if your problem is disk IO, you might try to convert the file to a more modern format with compression (hdf5, feather or parquet). But already zipping the file might help. Of course this adds the overhead of decompression, but if reading from your disk is the bottleneck, it might actually be faster to read less and extract in memory. — Magellan88, Feb 22 '19 at 12:32

score 5 · Accepted Answer · answered Feb 22 '19 at 12:27

When reading a huge file from disk, the bottleneck is the IO. As Pandas is highly optimized with a C parsing engine, there is very little to gain. Any attempt to use multi-processing or multi-threading is likely to be less performant, because you will spend the same time for loading the data from the disk, and only add some overhead for synchronizing the different processes or threads.

score 3 · Answer 2 · answered Feb 22 '19 at 14:48

Consider what this means:

df = df.compute(scheduler='processes')

each process accesses some chunk of the original data. This may be in parallel or, quite likely, limited by the IO of the underlying storage device
each process makes a dataframe from its data, which is CPU-heavy and will parallelise well
each chunk is serialised by the process and communicated to the client from where you called it
the client deserialises the chunks and concatenates them for you.

Short story: don't use Dask if your only job is to get a Pandas dataframe in memory, it only adds overhead. Do use Dask if you can operate on the chunks independently, and only collect small output in the client (e.g., groupby-aggregate, etc.).

score 0 · Answer 3 · answered Feb 22 '19 at 12:12

0

You could use mutliprocessthing, but like the file is not cut, you risk to have waiting when a program/thread wants to access file (its the case following your mesure).

If you want to use correctly multiprocessing, i recommand you to cut the file in differents parts and merge all results in the final operation

answered Feb 22 '19 at 12:12

Frenchy

16,386
3
16
39

Is it possible to cut it before internally before reading it? And how would it loke like in code? – Varlor Feb 22 '19 at 12:14
i dunno how you create your file. The ideal is to create multifile during the creation of the big file . (for example create a new file every 100000 lines for example). after tou launch the same number of threads than the number of file (see multiprocessing python on google) – Frenchy Feb 22 '19 at 12:23
This is exactly what Dask does. If you did it yourself by manual multiprocessing, you would not win. – mdurant Feb 22 '19 at 14:49

score 0 · Answer 4 · answered Feb 22 '19 at 23:15

I recommend trying different numbers of processes with the num_workers keyword argument to compute.

Contrary to what is said above, read_csv is definitely compute-bound, and having a few processes working in parallel will likely help.

However, having too many processes all hammering at the disk at the same time might cause a lot of contention and slow things down.

I recommend experimenting a bit with different numbers of processes to see what works best.

Dask read csv versus pandas read csv

4 Answers4