I have a csv file on client . To make it's data accessible to worker nodes on different machines ,I am using client.scatter with reference to Loading local file from client onto dask distributed cluster . This is my code .
import pandas as pd
import time
import dask as dask
import dask.distributed as distributed
import dask.dataframe as dd
import dask.delayed as delayed
from dask.distributed import Client,progress
client = Client('scheduler:port')
df = pd.read_csv('file.csv')
[f]= client.scatter([df]) # send dataframe to one worker
ddf = dd.from_delayed([f], meta=df).repartition(npartitions=100).persist()
ddf = ddf.groupby(['col1'])[['col2']].sum()
future=client.compute(ddf)
print future
progress(future)
result = client.gather(future)
print result
This code is running fine for a csv file with 1 million records but for a csv file with 60 million records ,it is taking too long .Am I making any mistake ? Any help would be appreciated.