1

I have a csv file on client . To make it's data accessible to worker nodes on different machines ,I am using client.scatter with reference to Loading local file from client onto dask distributed cluster . This is my code .

    import pandas as pd
    import time
    import dask as dask
    import dask.distributed as distributed
    import dask.dataframe as dd
    import dask.delayed as delayed
    from dask.distributed import Client,progress

    client = Client('scheduler:port')
    df = pd.read_csv('file.csv')
    [f]= client.scatter([df])  # send dataframe to one worker
    ddf = dd.from_delayed([f], meta=df).repartition(npartitions=100).persist()
    ddf = ddf.groupby(['col1'])[['col2']].sum()

    future=client.compute(ddf)
    print future
    progress(future)
    result = client.gather(future)
    print result

This code is running fine for a csv file with 1 million records but for a csv file with 60 million records ,it is taking too long .Am I making any mistake ? Any help would be appreciated.

Sweta
  • 63
  • 3
  • 13

0 Answers0