I am reading a csv into Dask Dataframe and then calling SimpleImputer from dask_ml library. I am facing two different kinds of issues.
Issue 1) Simple Imputer on Dask fails with FileNotFound when in reality i am able to read the columns. code:
import dask.dataframe as dd
df = dd.read_csv('outlier.csv')
X = df.drop('Column_A', axis=1)
print(X.columns) # Print statement works. It gives me all the rest of the columns
p = SimpleImputer().fit_transform(X)
output:
Error
Traceback (most recent call last):
File "C:\Users\user\Documents\code\blah.py", line 127, in train_blahblah_model
p = SimpleImputer().fit_transform(X)
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\sklearn\base.py", line 699, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\dask_ml\impute.py", line 53, in fit
self._fit_frame(X)
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\dask_ml\impute.py", line 80, in _fit_frame
self.statistics_ = pd.Series(dask.compute(avg)[0], index=X.columns)
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\dask\base.py", line 561, in compute
results = schedule(dsk, keys, **kwargs)
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py", line 2681, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py", line 1990, in gather
return self.sync(
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py", line 836, in sync
return sync(
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\utils.py", line 340, in sync
raise exc.with_traceback(tb)
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\utils.py", line 324, in f
result[0] = yield future
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\tornado\gen.py", line 762, in run
value = future.result()
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py", line 1855, in _gather
raise exception.with_traceback(traceback)
File "/opt/conda/lib/python3.8/site-packages/dask/bytes/core.py", line 185, in read_block_from_file
File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py", line 102, in __enter__
File "/opt/conda/lib/python3.8/site-packages/fsspec/spec.py", line 930, in open
File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py", line 117, in _open
File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py", line 199, in __init__
File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py", line 204, in _open
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/user/Documents/code/outlier.csv'
- Read csv from Pandas and then put into Dask
df = pd.read_csv('outlier.csv', index_col='new')
df = dd.from_pandas(df, npartitions=3)
X = df.drop('Column_A', axis=1)
print(X.columns) # Print statement works. It gives me all the rest of the columns
p = SimpleImputer().fit_transform(X)
output : Error on line SimpleImputer().fitTransform(X)
AttributeError: 'DataFrame' object has no attribute '_data'
Note: all theses stuff works in pandas when I Use IterativeImputer to fit transform. The problem happens when I try to generate the model using dask as i eventually want to use dask workers to generate my model