0

I expected from the simple examples using Dask delayed I have read that I could essentially replicate gridsearchcv from scikit-learn with a couple of function calls as follows. It appears that the model is never fit (model.fit(...)) because the rest of the loop continues (pred(...))?

Is there an issue with how I am nesting the functions? I am aware that there is gridsearchcv for dask, but the problem is my real model is a multi-input Keras LSTM and you cant pass a 3d array as 'X'. The code works fine in serial without Dask.

Here is a small reproducible example:

import dask
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import KFold,ParameterGrid
from sklearn.metrics import mean_squared_error 
from keras import Sequential
from keras.layers import Dense

boston = load_boston()
y=boston.target
X=boston.data


@dask.delayed
def create_model(dense_nodes):
    model = Sequential()
    model.add(Dense(dense_nodes, input_dim=13, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

@dask.delayed
def cv_model(X,y,kf,params_dct):

    dense_nodes = params_dct['dense']

    hold_actual=np.zeros((X.shape[0],1))
    hold_preds=np.zeros((X.shape[0],1))

    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        model=create_model(dense_nodes)
        model.fit(X_train,y_train,batch_size=64, epochs=5)
        pred=model.predict(X_test)
        hold_actual[test_index,0]=y_test.ravel()
        hold_preds[test_index,0]=pred.ravel()

    return(mean_squared_error(hold_actual,hold_preds))



kfold=KFold(n_splits=3,random_state=4521)
grid=ParameterGrid({'dense':[2,3,4,5,6,7,8,9,10]})

output=[]
for i in grid:
    output.append(cv_model(X,y,kfold,grid[0]))

total=dask.delayed(output)
total.compute()




---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-53-2116b76de18c> in <module>()
     52 
     53 total=dask.delayed(output)
---> 54 total.compute()

~/anaconda3/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
    153         dask.base.compute
    154         """
--> 155         (result,) = compute(self, traverse=False, **kwargs)
    156         return result
    157 

~/anaconda3/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
    402     postcomputes = [a.__dask_postcompute__() if is_dask_collection(a)
    403                     else (None, a) for a in args]
--> 404     results = get(dsk, keys, **kwargs)
    405     results_iter = iter(results)
    406     return tuple(a if f is None else f(next(results_iter), *a)

~/anaconda3/lib/python3.6/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, **kwargs)
     73     results = get_async(pool.apply_async, len(pool._pool), dsk, result,
     74                         cache=cache, get_id=_thread_get_id,
---> 75                         pack_exception=pack_exception, **kwargs)
     76 
     77     # Cleanup pools associated to dead threads

~/anaconda3/lib/python3.6/site-packages/dask/local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
    519                         _execute_task(task, data)  # Re-execute locally
    520                     else:
--> 521                         raise_exception(exc, tb)
    522                 res, worker_id = loads(res_info)
    523                 state['cache'][key] = res

~/anaconda3/lib/python3.6/site-packages/dask/compatibility.py in reraise(exc, tb)
     65         if exc.__traceback__ is not tb:
     66             raise exc.with_traceback(tb)
---> 67         raise exc
     68 
     69 else:

~/anaconda3/lib/python3.6/site-packages/dask/local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    288     try:
    289         task, data = loads(task_info)
--> 290         result = _execute_task(task, data)
    291         id = get_id()
    292         result = dumps((result, id))

~/anaconda3/lib/python3.6/site-packages/dask/local.py in _execute_task(arg, cache, dsk)
    269         func, args = arg[0], arg[1:]
    270         args2 = [_execute_task(a, cache) for a in args]
--> 271         return func(*args2)
    272     elif not ishashable(arg):
    273         return arg

<ipython-input-53-2116b76de18c> in cv_model(X, y, kf, params_dct)
     38         pred=model.predict(X_test)
     39         hold_actual[test_index,0]=y_test.ravel()
---> 40         hold_preds[test_index,0]=pred.ravel()
     41 
     42     return(mean_squared_error(hold_actual,hold_preds))

ValueError: setting an array element with a sequence.

ADD #1

Here is the second attempt, the error remains.

import dask
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import KFold,ParameterGrid
from sklearn.metrics import mean_squared_error 
from keras import Sequential
from keras.layers import Dense
import tensorflow as tf
boston = load_boston()
y=boston.target
X=boston.data

import tensorflow as tf


#You never want to call delayed functions from within other delayed functions
 #https://stackoverflow.com/questions/51219354/cant-train-keras-model-with-dask

@dask.delayed
def create_model(dense_nodes):
    model = Sequential()
    model.add(Dense(dense_nodes, input_dim=13, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model


def cv_model(X,y,kf,params_dct):

    dense_nodes = params_dct['dense']

    hold_actual=np.zeros((X.shape[0],1))
    hold_preds=np.zeros((X.shape[0],1))


    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]


        model=create_model(dense_nodes)
        model.fit(X_train,y_train,batch_size=64, epochs=5)

        pred=model.predict(X_test)

        hold_actual[test_index,0]=y_test.ravel()
        hold_preds[test_index,0]=pred.ravel()
    return(dask.delayed(mean_squared_error(hold_actual,hold_preds)))



kfold=KFold(n_splits=3,random_state=4521)
grid=ParameterGrid({'dense':[2,3,4,5,6,7,8,9,10]})

output=[]
for i in grid:
    delayed_value=cv_model(X,y,kfold,grid[0])

result=delayed_value.compute()

ADD #2

It turns out that Keras / TF has an issue that causes an error outside of Dask. I will address this in a separate question. So, I swapped out the Keras model for an Xgboost one to allow for a proper setup of Dask for this purpose.

Here is that code. I did find that I needed to comment out the call to Dask delayed in the mean_squared_error bit.

import dask
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import KFold,ParameterGrid
from sklearn.metrics import mean_squared_error 

import xgboost as xgb





boston = load_boston()
y=boston.target
X=boston.data


@dask.delayed
def cv_model(X,y,kf,params_dct):


    hold_actual=np.zeros((X.shape[0],1))
    hold_preds=np.zeros((X.shape[0],1))

    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]


        dtrain=xgb.DMatrix(data=X_train, label=y_train)
        dtest=xgb.DMatrix(data=X_test, label=y_test)

        regmod = xgb.train(params_dct, dtrain, 10)
        pred=regmod.predict(dtest)
        hold_actual[test_index,0]=y_test.ravel()
        hold_preds[test_index,0]=pred.ravel()

    #return(dask.delayed(mean_squared_error)(np.array(hold_actual),np.array(hold_preds)))
    return({'result':mean_squared_error(np.array(hold_actual),np.array(hold_preds)),'param':params_dct})


kfold=KFold(n_splits=3,random_state=4521)
grid=ParameterGrid({'max_depth':[2,3,4,5,6,7,8,9,10], 'eta':[0.01,0.05], 'min_child_weight': [1,2,3,4,5]})

output=[]
for i in grid:
    output.append(cv_model(X,y,kfold,i))

total=dask.delayed(output)
result=total.compute()
B_Miner
  • 1,840
  • 4
  • 31
  • 66

1 Answers1

2

You don't want to call dask.delayed on the cv_model function. You never want to call delayed functions from within other delayed functions. Instead functions that call delayed functions are often very fast (they don't do any work) and so you want to call them immediately rather than lazily.

It looks like your for loop creates many models lazily, calls methods of those models (which will also be lazy), and then calls mean_squared_error on the results. This function will probably have to also be marked as delayed like

return dask.delayed(mean_squared_error)(hold_actual, hold_preds))

Then, if you remove the delayed decorator from cv_model you should be able to do something like:

delayed_value = cv_model(...)
result = delayed_value.compute()

In your second example you call model.fit without using the return value:

    model=create_model(dense_nodes)
    model.fit(X_train,y_train,batch_size=64, epochs=5)
    pred=model.predict(X_test)

Delayed doesn't operate in place so calling model.fit alone won't do anything. You probably want

model = model.fit(...)

Here you're calling dask.delayed on a result, rather than on the mean_squared_error function

return(dask.delayed(mean_squared_error(hold_actual,hold_preds)))

See https://github.com/dask/dask/pull/3737 for new docs

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • If I understood correctly, the error still remains. I posted the second attempt above. – B_Miner Jul 07 '18 at 16:17
  • See update. Also see new docs that this inspired: https://github.com/dask/dask/pull/3737 – MRocklin Jul 08 '18 at 15:44
  • Can you take a look above? I found there is an issue with Keras so I took that out for a straightforward xgboost model just to make sure I had the process correct. What do you think? I did find I needed to comment out the delayed call on mean_squared_error or just delayed objects were returned? – B_Miner Jul 13 '18 at 01:38
  • I fear that this is not correct as it is slower with Dask than without. – B_Miner Jul 14 '18 at 01:14