0

I have been trying to write custom classes for Preprocessing followed by Feature selection and Machine Learning algorithms as well.

I cracked this (preprocessing only) using @delayed. But when I read from the tutorials that the same can be achieved using Client. It caused two problems.

Running as a script. Not as a Jupyter notebook

First Problem:

# Haven't run any scheduler or worker manually
client = Client() # Nothing passed as an argument
# Local Cluster is not working;
Error:... 
       if __name__=='__main__': 
            freeze_support()
      ...

I tried the same in Jupyter Notebook, without running any scheduler or workers in different terminals. It worked!!

Now, I triggered 3 terminals with 1 scheduler and 2 workers and changed it to Client('IP') in the script. Error resolved, any reason for this behavior.

Second Problem:

The error mentioned in the title of the question. Passed the client = Client('IP') as the argument to the constructor and used self.client.submit things to the cluster. But failed with the error message

Error: No module name 'diya_info'

Here's the code:

main.py

import dask.dataframe as dd
from diya_info import Diya_Info
import time
# from dask import delayed
from dask.distributed import Client

df = dd.read_csv(
    '/Users/asifali/workspace/playground/flask/yellow_tripdata_2015- 01.csv')

# df = delayed(df.fillna(0.3))
# df = df.compute()

client = Client('192.168.0.129:8786')

X = df.drop('payment_type', axis=1).copy()
y = df['payment_type']


Instance = Diya_Info(X, y, client)
s = time.ctime(int(time.time()))
print(s)


Instance = Instance.fit(X, y)


e = time.ctime(int(time.time()))
print(e)
# print((e-s) % 60, ' secs')

diya_info.py

from sklearn.base import TransformerMixin, BaseEstimator
from dask.multiprocessing import get
from dask import delayed, compute


class Diya_Info(BaseEstimator, TransformerMixin):
    def __init__(self, X, y, client):
        assert X is not None, 'X can\'t be None'
        assert type(X).__name__ == 'DataFrame', 'X not of type DataFrame'
        assert y is not None, 'y can\'t be None'
        assert type(y).__name__ == 'Series', 'y not of type Series'

        self.client = client

    def fit(self, X, y):
        self.X = X
        self.y = y
        # X_status = self.has_null(self.X)
        # y_status = self.has_null(self.y)
        # X_len = self.get_len(self.X)
        # y_len = self.get_len(self.y)
        X_status = self.client.submit(self.has_null, self.X)
        y_status = self.client.submit(self.has_null, self.y)
        X_len = self.client.submit(self.get_len, self.X)
        y_len = self.client.submit(self.get_len, self.y)
        # X_null, y_null, X_length, y_length
        X_null, y_null, X_length, y_length = self.client.gather(
        [X_status, y_status, X_len, y_len])

        assert X_null == False, 'X contains some columns with null/NaN values'
        assert y_null == False, 'y contains some columns with null/NaN values'
        assert X_length == y_length, 'Shape mismatch, X and y are of different length'
        return self

    def transform(self, X):
        return X

    @staticmethod
    # @delayed
    def has_null(df):
        return df.isnull().values.any()

    @staticmethod
    # @delayed
    def get_len(df):
        return len(df)

Here's the full stacktrace:

Sat Aug 11 13:29:08 2018
distributed.utils - ERROR - No module named 'diya_info'
Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/site-packages/distributed/utils.py", line 238, in f
    result[0] = yield make_coro()
  File "/anaconda3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/anaconda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/anaconda3/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/anaconda3/lib/python3.6/site-packages/distributed/client.py", line 1315, in _gather
    traceback)
  File "/anaconda3/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/anaconda3/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 59, in loads
    return pickle.loads(x)
ModuleNotFoundError: No module named 'diya_info'
Traceback (most recent call last):
  File "notebook/main.py", line 24, in <module>
    Instance = Instance.fit(X, y)
  File "/Users/asifali/workspace/pythonProjects/ML-engine-DataX/pre-processing/notebook/diya_info.py", line 28, in fit
    X_status, y_status, X_len, y_len)
  File "/anaconda3/lib/python3.6/site-packages/distributed/client.py", line 2170, in compute
    result = self.gather(futures)
  File "/anaconda3/lib/python3.6/site-packages/distributed/client.py", line 1437, in gather
    asynchronous=asynchronous)
  File "/anaconda3/lib/python3.6/site-packages/distributed/client.py", line 592, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/distributed/utils.py", line 254, in sync
    six.reraise(*error[0])
  File "/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/anaconda3/lib/python3.6/site-packages/distributed/utils.py", line 238, in f
    result[0] = yield make_coro()
  File "/anaconda3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/anaconda3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/anaconda3/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/anaconda3/lib/python3.6/site-packages/distributed/client.py", line 1315, in _gather
    traceback)
  File "/anaconda3/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/anaconda3/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 59, in loads
    return pickle.loads(x)
ModuleNotFoundError: No module named 'diya_info'

If I uncomment the @delayed and few more comments it works. But how to make it work by passing in the client as an argument. Idea is to use the same client for all the libs I'm trying to write.

UPDATE 1: I fixed the second problem by removing the @staticmethod decorators and placing the functions in the fit closure. But what's wrong with the @staticmethod, these decorators are meant for non-self related stuff, right?

Here's the diya_info.py:

...
def fit(self, X, y):
   self.X = X
   self.y = y

   # function removed from @staticmethod
   def has_null(df): return df.isnull().values.any()
   # function removed from @staticmethod
   def get_len(df): return len(df)

   X_status = self.client.submit(has_null, self.X)
   y_status = self.client.submit(has_null, self.y)
...

Is there a way to do it with @staticmethod. I don't feel good with the way I have solved this issue. Still no clue about Problem 1

Asif Ali
  • 1,422
  • 2
  • 12
  • 28

1 Answers1

1
ModuleNotFoundError: No module named 'diya_info'

This means that while your client has access to this module, your workers do not. A simple way to resolve this would be to upload your script to your workers.

client.upload_file('diya_info.py')

But in general it's on you to ensure that your workers and clients all have the same software environment

MRocklin
  • 55,641
  • 23
  • 163
  • 235