1

I am trying to run a machine learning prediction job in parallel on a huge pandas dataframe. It seems like ray is a nice package for multiprocessing in Python. This is the code:

model_path = './models/lr.pkl'
df = pd.read_csv('./data/input.csv')
dfs = np.array_split(df, 4)

features = ['item_text', 'description', 'amount']
ray.init()

@ray.remote
def predict(model_path, df, features):
    model = joblib.load(model_path)
    pred_df = model.predict(df[features])
    
    return pred_df

result_ids = []
for i in range(4):
    result_ids.append(predict.remote(model_path, dfs[i], features))
    
results = ray.get(result_ids) 

When I ran it, I got the following error:

PicklingError: args[0] from __newobj__ args has the wrong class 

I take it args[0] refers to model_path. It is just a string, why wrong class? What am I missing?

ddd
  • 4,665
  • 14
  • 69
  • 125
  • Maybe [this SO post](https://stackoverflow.com/questions/44911539/pickle-picklingerror-args0-from-newobj-args-has-the-wrong-class-with-hado) helps. If you import the class which you want to load in the def, it might work. – above_c_level Jun 27 '20 at 18:18

1 Answers1

0

It turns out the remote function can't take more than two arguments. After I tupled the two static arguments, it worked.

@ray.remote
def predict(df, args):
    model_path, features = args
    model = joblib.load(model_path)
    pred_df = model.predict(df[features])
    
    return pred_df

args = (model_path, features)
result_ids = []
for i in range(4):
    result_ids.append(predict.remote(dfs[i], args))
ddd
  • 4,665
  • 14
  • 69
  • 125