0

I am reading a csv into Dask Dataframe and then calling SimpleImputer from dask_ml library. I am facing two different kinds of issues.

Issue 1) Simple Imputer on Dask fails with FileNotFound when in reality i am able to read the columns. code:

 import dask.dataframe as dd
 df = dd.read_csv('outlier.csv')
 X = df.drop('Column_A', axis=1)
 print(X.columns)  # Print statement works. It gives me all the rest of the columns
 p = SimpleImputer().fit_transform(X)

output:

Error
Traceback (most recent call last):
 File "C:\Users\user\Documents\code\blah.py", line 127, in train_blahblah_model
    p = SimpleImputer().fit_transform(X)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\sklearn\base.py", line 699, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\dask_ml\impute.py", line 53, in fit
    self._fit_frame(X)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\dask_ml\impute.py", line 80, in _fit_frame
    self.statistics_ = pd.Series(dask.compute(avg)[0], index=X.columns)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\dask\base.py", line 561, in compute
    results = schedule(dsk, keys, **kwargs)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py", line 2681, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py", line 1990, in gather
    return self.sync(
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py", line 836, in sync
    return sync(
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\utils.py", line 324, in f
    result[0] = yield future
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\tornado\gen.py", line 762, in run
    value = future.result()
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py", line 1855, in _gather
    raise exception.with_traceback(traceback)
  File "/opt/conda/lib/python3.8/site-packages/dask/bytes/core.py", line 185, in read_block_from_file
  File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py", line 102, in __enter__
  File "/opt/conda/lib/python3.8/site-packages/fsspec/spec.py", line 930, in open
  File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py", line 117, in _open
  File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py", line 199, in __init__
  File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py", line 204, in _open
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/user/Documents/code/outlier.csv'
  1. Read csv from Pandas and then put into Dask
df = pd.read_csv('outlier.csv', index_col='new')
df = dd.from_pandas(df, npartitions=3)
X = df.drop('Column_A', axis=1)
print(X.columns)  # Print statement works. It gives me all the rest of the columns
p = SimpleImputer().fit_transform(X) 
            

output : Error on line SimpleImputer().fitTransform(X)

AttributeError: 'DataFrame' object has no attribute '_data'

Note: all theses stuff works in pandas when I Use IterativeImputer to fit transform. The problem happens when I try to generate the model using dask as i eventually want to use dask workers to generate my model

Seeker
  • 163
  • 1
  • 12

1 Answers1

0

This issue is resolved. The problem was with different version of pandas on client and the worker. Worker was on 1.0.1. I upgraded it to 1.2.3 on both the machines and this error went away.

Please also refer to the question joblib connection to Dask backend: tornado.iostream.StreamClosedError: Stream is closed to resolve other possible issues.

Seeker
  • 163
  • 1
  • 12