I am trying to run a machine learning prediction job in parallel on a huge pandas
dataframe. It seems like ray
is a nice package for multiprocessing in Python. This is the code:
model_path = './models/lr.pkl'
df = pd.read_csv('./data/input.csv')
dfs = np.array_split(df, 4)
features = ['item_text', 'description', 'amount']
ray.init()
@ray.remote
def predict(model_path, df, features):
model = joblib.load(model_path)
pred_df = model.predict(df[features])
return pred_df
result_ids = []
for i in range(4):
result_ids.append(predict.remote(model_path, dfs[i], features))
results = ray.get(result_ids)
When I ran it, I got the following error:
PicklingError: args[0] from __newobj__ args has the wrong class
I take it args[0]
refers to model_path
. It is just a string, why wrong class? What am I missing?