0

In below code, why dd.read_csv is running on cluster? client.read_csv should run on cluster.

import dask.dataframe as dd
from dask.distributed import Client

client=Client('10.31.32.34:8786')
dd.read_csv('file.csv',blocksize=10e7)
dd.compute()

Is it the case that once I make a client object, all api calls will run on cluster?

Jacob Tomlinson
  • 3,341
  • 2
  • 31
  • 62
Dhruv Kumar
  • 399
  • 2
  • 13

2 Answers2

2

The commnad dd.read_csv('file.csv', blocksize==1e8) will generate many pd.read_csv(...) commands, each of which will run on your dask workers. Each task will look for the file.csv file, seek to some location within that file defined by your blocksize, and read those bytes to create a pandas dataframe. The file.csv file should be universally present for each worker.

It is common for people to use files that are on some universally available storage, like a network file system, database, or cloud object store.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
1

In addition to the first answer:

yes, creating a client to a distributed client will make that be the default scheduler for all following dask work. You can, however, specify where you would like work to run as follows

  • for a specific compute,

    dd.compute(scheduler='threads')
    
  • for a black of code,

    with dask.config.set(scheduler='threads'):
        dd.compute()
    
  • until further notice,

    dask.config.set(scheduler='threads')
    dd.compute()
    

See http://dask.pydata.org/en/latest/scheduling.html

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • That's correct and I understand that. But assuming that I've set up a cluster using dask-ssh, what command should I run for compute(...) so that the computation runs on a distributed cluster **and** what command should I run so that the computation occurs on my single machine, even though the cluster has been set up. In other words, I want that even if I've made a client object using `client=Client('10.31.32.34:8786')`, which will connect to the cluster, I want to be able to run the computation for some dataframe on the cluster, and some on single machine scheduler. My question is how? – Dhruv Kumar Jun 29 '18 at 06:14
  • 1
    Sorry, I don't understand, doesn't `dd.compute(scheduler=` do this for you? – mdurant Jun 29 '18 at 13:56
  • scheduler='thread' would make separate threads but a single process, for separate workers. Similarly scheduler='processes' would make separate threads for separate workers, and scheduler='synchronous' would execute it in a single thread in a single process. How can with the help of these, I can define scheduler on Single Machine distributed scheduler or Cluster scheduler? – Dhruv Kumar Jun 30 '18 at 07:14
  • Please update your original question to demonstrate exactly the behaviour that you would like to see, it still isn't clear to me. – mdurant Jun 30 '18 at 13:38