i'm fairly new to machine learning in general and want to store my model in the cloud in order to make online predictions.
I successfully trained a Logistic Regression model with TfIdf vecotrizer (for Sentiment Analysis) on Scikit-learn locally using Jupyter Notebook and on Google AI Platform using their Training Job feature.
I must mention that i included bs4, nltk, lxml in my training package setup.py file as the required PyPI packages.
My training algorithm goes like this:
Imported a CSV file of input strings and their labels (output) as a pandas dataframe (the model has 1 input variable, which is the string.)
Preprocess the input strings using bs4 and nltk to remove unnecessary characters, stopwords, and make all the characters lowercase (to reproduce this simply use lowercase alphabet-only strings).
Create a pipeline
from sklearn.feature_extraction.text import TfidfVectorizer tvec=TfidfVectorizer() lclf = LogisticRegression(fit_intercept = False, random_state = 255, max_iter = 1000) from sklearn.pipeline import Pipeline model_1= Pipeline([('vect',tvec),('clf',lclf)])
Do a cross-validation using GridSearchCV
from sklearn.model_selection import GridSearchCV param_grid = [{'vect__ngram_range' : [(1, 1)], 'clf__penalty' : ['l1', 'l2'], 'clf__C' : [1.0, 10.0, 100.0]}, {'vect__ngram_range' : [(1, 1)], 'clf__penalty' : ['l1', 'l2'], 'clf__C' : [1.0, 10.0, 100.0], 'vect__use_idf' : [False], 'vect__norm' : [False]}] gs_lr_tfidf = GridSearchCV(model_1, param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=-1) gs_lr_tfidf.fit(X_train, y_train)
Get my desired model with the best estimation. This is the model saved in the Google model.joblib file.
clf = gs_lr_tfidf.best_estimator_
I can output a simple prediction on my Jupyter Notebook file using
predicted = clf.predict(["INPUT STRING"])
print(predicted)
It prints the predicted label for my input string. Such as ['good'] or ['bad']
But while the model was successfully trained and submitted to the AI Platform, when i try to request a prediction such as (in the required JSON format):
["the quick brown fox jumps over the lazy dog"]
["hi what is up"]
The shell returns with this error:
{
"error": "Prediction failed: Exception during sklearn prediction:
'numpy.ndarray' object has no attribute 'lower'"
}
What could have possibly gone wrong here?
Is this possibly a problem with the dependencies, that i too must install packages for bs4, lxml and nltk in my google-hosted model?
Or is my input JSON incorrectly formatted?
Thanks for your help.