2

I have a function to read large csv files using dask dataframe and then convert to pandas dataframe, which takes quite a lot time. The code is:

def t_createdd(Path):
  dataframe = dd.read_csv(Path, sep = chr(1), encoding = "utf-16")
  return dataframe

#Get the latest file
Array_EXT = "Export_GTT_Tea2Array_*.csv"
array_csv_files = sorted([file
             for path, subdir, files in os.walk(PATH)
             for file in glob(os.path.join(path, Array_EXT))])

latest_Tea2Array=array_csv_files[(len(array_csv_files)-(58+25)): 
(len(array_csv_files)-58)]


 Tea2Array_latest = t_createdd(latest_Tea2Array)

 #keep only the required columns
 Tea2Array = Tea2Array_latest[['Parameter_Id','Reading_Id','X','Value']]

 P1MI3 = Tea2Array.loc[Tea2Array['parameter_id']==168566]
 P1MI3=P1MI3.compute()

 P1MJC_main = Tea2Array.loc[Tea2Array['parameter_id']==168577]
 P1MJC_old=P1MJC_main.compute()

P1MI3=P1MI3.compute() and P1MJC_old=P1MJC_main.compute() takes around 10 and 11 mins respectively to execute. Is there any way to reduce the time.

K.S
  • 113
  • 13

1 Answers1

1

I would encourage you to consider, with reference to the Dask documentation, why you would expect the process to be any faster than using Pandas alone. Consider:

  • file access may be from several threads, but you only have one disc interface bottleneck, and likely performs much better reading sequentially than trying to read several files in parallel
  • reading CSVs is CPU-heavy, and needs the python GIL. The multiple threads will not actually be running in parallel
  • when you compute, you materialise the whole dataframe. It is true that you appear to be selecting a single row in each case, but Dask has no way to know in which file/part it is.
  • you call compute twice, but could have combined them: Dask works hard to evict data from memory which is not currently needed by any computation, so you do double the work. By calling compute on both outputs, you would halve the time.

Further remarks:

  • obviously you would do much better if you knew which partition contained what
  • you can get around the GIL using processes, e.g., Dask's distributed scheduler
  • if you only need certain columns, do not bother to load everything and then subselect, include those columns right in the read_csv function, saving a lot of time and memory (true for pandas or Dask).

To compute both lazy things at once:

dask.compute(P1MI3, P1MJC_main)
mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Hi,thanks for giving such an elaborate explanation. It made things clear for me as Im using dask for the first time. You mentioned combining the compute. Can you please let me know how that can be done. – K.S Sep 19 '19 at 18:08