1

I am currently Trying Dask locally (parallel processing) for the first Time on a large Dataset (3.2 Go). I am comparing Dasks speed with pandas on simple computations. Using Dask seems to result in slower execution time in any task beside reading and transforming data.

example:

#pandas code
import numpy as np
import pandas as pd
import time

T=pd.read_csv("transactions_train.csv")

Data reading is slow, it takes about 1.5 minutes to read the data.

After trying simple Computations

%%time
T.price.mean()

this executes in about 3 seconds

as for dask:

from dask.distributed import Client, progress,LocalCluster
client = Client()
client

import dask.dataframe as dd

DT = dd.read_csv("transactions_train.csv")

executed in 0.551 seconds

%%time
DT.price.mean().compute()

Takes 25 seconds to run this.

It gets worse for heavy computation like modelling.

Any help would be appreciated as I am new to dask and not sure if I am not using it right.

My pc has 4 cores

  • This isn't really an apples-to-apples comparison. When you time pandas computing the mean, you're only measuring how long it takes to the compute the mean. However, when you time dask computing the mean, you're including the the read step with the computation step. So if pandas takes 1.5 minutes to read the data, the total times are actually Pandas = 93 sec vs Dask = 26 sec – Paul H Apr 12 '22 at 19:36
  • @PaulH yeah, but I will do multiple computations and by then the time I gained from the read function will have been lost. My English is not great, but I hope you get my point. – Not_So_Solid_Snake Apr 12 '22 at 19:54
  • you can do `res1 = fn1(df); res2 = fn2(df); res3 = fn3(df); res1, res2, res3 = dask.compute([res1, res2, res3], sync=True)`. also, check out the ["persist intelligently"](https://docs.dask.org/en/stable/dataframe-best-practices.html#persist-intelligently) section of the best practices guide. generally, read the whole best practices guide. – Michael Delgado Apr 12 '22 at 20:13
  • If you want something fast, CSV is really bad. It is really not designed to GB databases. You need to use at least a binary format. Using parallelism to decode CSV files is like taking a plane to to go 100 m away and then try to optimize the travel... By the way the same applies to the mean. This is a memory bound operation. 3s for 1 column of a 3 GB database is slow... 1 thread should be able to be at least three time faster (on any basic PC -- >6x on mine). – Jérôme Richard Apr 12 '22 at 22:20
  • 1
    @Not_So_Solid_Snake that’s not correct. If you set up your dask work flow properly, you’ll only need to read the CSV once. remember computing the mean didn’t take 25 sec, reading the data and computing the mean took 25 sec. – Paul H Apr 13 '22 at 03:37
  • @MichaelDelgado That did it. Now the CSV is only read one time. – Not_So_Solid_Snake Apr 13 '22 at 12:15

1 Answers1

2

Avoid calling compute repeatedly: For example for these simple operations, do something like this

 xmin, xmax = dask.compute(df.x.min(), df.x.max())
Sebastian
  • 48
  • 6