I'm trying to read in a 220 GB csv file with dask. Each line of this file has a name, a unique id, and the id of its parent. Each entry has multiple generations of parents, eventually I'd like to be able to reassemble the whole tree, but it's taking 15 minutes when I try to find the parent of one entry (compared to 4 minutes in PySpark with roughly the same node configuration). I'm running across four nodes, with the scheduler on one node and 12 workers spread across the other three nodes. Here's the code I'm using:
#!/usr/bin/env python
import dask.dataframe as dd
from dask.distributed import Client
client = Client("hostnode:8786")
def find_parent(df,id):
print "searching for",id
newdf = df[df["fid"] == id][["pid","name","fid"]].compute()
print newdf.head()
print "found it",newdf["name"].tolist()[0]
biggie = dd.read_csv("input_file.csv",delimiter=",",names=["fid","pkn","pid","name","updated"],escapechar="\\")
print biggie.head()
find_parent(biggie,"0x240006b93:0x10df5:0x0")
Tips on how to speed this up would be greatly appreciated.