2

I have a sample data set present in my local and I'm trying to do some basic opertaions on a cluster.

    import dask.dataframe as ddf
    from dask.distributed import Client 
    client = Client('Ip address of the scheduler')
    import dask.dataframe as ddf
    csvdata = ddf.read_csv('Path to the CSV file')

Client is connected to a scheduler which in turn is connected to two workers(on other machines).

My Questions may be pretty trivial.

  1. Should this csv file be present on other worker nodes?

    I seem to get file not found errors.

  2. Using,

    futures=client.scatter(csvdata)
    x = ddf.from_delayed([future], meta=df)
    #Price is a column in the data
    df.Price.sum().compute(get=client.get) #returns" dd.Scalar<series-..., dtype=float64>" How do I access it?
    client.submit(sum, x.Price)  #returns "distributed.utils - ERROR - 6dc5a9f58c30954f77913aa43c792cc8"
    

Also, I did refer this Loading local file from client onto dask distributed cluster and http://distributed.readthedocs.io/en/latest/manage-computation.html

I thinking I'm mixing up a lot of things here and my understanding is muddled up. Any help would be really appreciated.

Linda
  • 627
  • 4
  • 14

1 Answers1

3

Yes, here dask.dataframe is assuming that the files you refer to in your client code are also accessible by your workers. If this is not the case then you will have you read in your data explicitly in your local machine and scatter it out to your workers.

It looks like you're trying to do exactly this, except that you're scattering dask dataframes rather than pandas dataframes. You will actually have to concretely load pandas data from disk before you scatter it. If your data fits in memory then you should be able to do exactly what you're doing now, but replace the dd.read_csv call with pd.read_csv

csvdata = pandas.read_csv('Path to the CSV file')
[future] = client.scatter([csvdata])
x = ddf.from_delayed([future], meta=df).repartition(npartitions=10).persist()
#Price is a column in the data
df.Price.sum().compute(get=client.get)  # Should return an integer

If your data is too large then you might consider using dask locally to read and scatter data to your cluster piece by piece.

import dask.dataframe as dd
ddf = dd.read_csv('filename')
futures = ddf.map_partitions(lambda part: c.scatter([part])[0]).compute(get=dask.get)  # single threaded local scheduler

ddf = dd.from_delayed(list(futures), meta=ddf.meta)
MRocklin
  • 55,641
  • 23
  • 163
  • 235