0

I am trying a simple parallel computation in Dask. This is my code.

  import time
  import dask as dask
  import dask.distributed as distributed
  import dask.dataframe as dd
  import dask.delayed as delayed
  from dask.distributed import Client,progress

  client = Client('localhost:8786')
  df = dd.read_csv('file.csv')
  ddf = df.groupby(['col1'])[['col2']].sum() 
  ddf = ddf.compute()
  print ddf

It seems fine from the documentation but on running I am getting this :

    Traceback (most recent call last):
    File "dask_prg1.py", line 17, in <module>
    ddf = ddf.compute()
    File "/usr/local/lib/python2.7/site-packages/dask/base.py", line 156, in compute
   (result,) = compute(self, traverse=False, **kwargs)
    File "/usr/local/lib/python2.7/site-packages/dask/base.py", line 402, in compute
   results = schedule(dsk, keys, **kwargs)
   File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 2159, in get
direct=direct)
  File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 1562, in gather
asynchronous=asynchronous)
 File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 652, in sync
return sync(self.loop, func, *args, **kwargs)
 File "/usr/local/lib/python2.7/site-packages/distributed/utils.py", line 275, in sync
six.reraise(*error[0])
 File "/usr/local/lib/python2.7/site-packages/distributed/utils.py", line 260, in f
result[0] = yield make_coro()
   File "/usr/local/lib/python2.7/site-packages/tornado/gen.py", line 1099, in run
value = future.result()
 File "/usr/local/lib/python2.7/site-packages/tornado/concurrent.py", line 260, in result
raise_exc_info(self._exc_info)
 File "/usr/local/lib/python2.7/site-packages/tornado/gen.py", line 1107, in run
yielded = self.gen.throw(*exc_info)
 File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 1439, in _gather
traceback)
File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 122, in read_block_from_file
with lazy_file as f:
File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 166, in __enter__
f = SeekableFile(self.fs.open(self.path, mode=mode))
 File "/usr/local/lib/python2.7/site-packages/dask/bytes/local.py", line 58, in open
return open(self._normalize_path(path), mode=mode)
 IOError: [Errno 2] No such file or directory: 'file.csv'

I am not understanding what is wrong.Kindly help me with this .Thank you in advance .

Sweta
  • 63
  • 3
  • 13

1 Answers1

2

You may wish to pass the absolute file path to read_csv. The reason is, that you are giving the work of opening and reading the file to a dask worker, and you might not have started that worked with the same working directory as your script/session.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • That's not the issue .Tried it. I also tried omitting the compute() statement and then running . It runs fine .So,I think the issue is with compute() statement. – Sweta Jul 19 '18 at 13:27
  • Are your workers on the same machine and do they have permission to see the same files? – mdurant Jul 19 '18 at 13:38
  • 1
    Thank you. That's the issue . Workers are on different machines and probably one of them is not able to access the csv file. – Sweta Jul 19 '18 at 13:54