4

I'm trying to read in a 220 GB csv file with dask. Each line of this file has a name, a unique id, and the id of its parent. Each entry has multiple generations of parents, eventually I'd like to be able to reassemble the whole tree, but it's taking 15 minutes when I try to find the parent of one entry (compared to 4 minutes in PySpark with roughly the same node configuration). I'm running across four nodes, with the scheduler on one node and 12 workers spread across the other three nodes. Here's the code I'm using:

#!/usr/bin/env python
import dask.dataframe as dd
from dask.distributed import Client
client = Client("hostnode:8786")
def find_parent(df,id):
   print "searching for",id
   newdf = df[df["fid"] == id][["pid","name","fid"]].compute()
   print newdf.head()
   print "found it",newdf["name"].tolist()[0]

biggie = dd.read_csv("input_file.csv",delimiter=",",names=["fid","pkn","pid","name","updated"],escapechar="\\")
print biggie.head()
find_parent(biggie,"0x240006b93:0x10df5:0x0")

Tips on how to speed this up would be greatly appreciated.

  • 1
    Upvote for question. Simple tasks like `len(df.index)` or `df.head(10)` / reading in first nrows take an extraordinary amount of time with `dask.dataframe` (almost instantaneous with `pandas`). Sometimes it's not clear what's happening underneath the hood. – jpp Jan 26 '18 at 14:07

1 Answers1

1

First, I would look at what is taking up the most time. You might want to look at the profile tab in the diagnostic dashboard. See http://distributed.readthedocs.io/en/latest/diagnosing-performance.html#statistical-profiling

I suspect that you're spending all of your time parsing the csv file. You might want to use the usecols= parameter to reduce the number of parameters that you need to parse. You should look at the pandas.read_csv docs for this. You might also consider using more processes with fewer threads. Pandas.read_csv does not release the GIL when parsing text columns.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Hi, thanks for the tips. Running more processes and fewer threads did help speed up things. Looking at the bokeh output as this process is running, I see it is re-reading in the file each time the find_parent function is called (I see it call panda's read_csv over and over). Is there some way to cache this array in distributed memory after the first read? I'm looking through the docs and it doesn't seem like it, but please correct me if I'm wrong. – user9270849 Jan 29 '18 at 19:28
  • You probably want `persist`. See http://distributed.readthedocs.io/en/latest/memory.html – MRocklin Jan 30 '18 at 13:31