0

i'm fairly new to machine learning in general and want to store my model in the cloud in order to make online predictions.

I successfully trained a Logistic Regression model with TfIdf vecotrizer (for Sentiment Analysis) on Scikit-learn locally using Jupyter Notebook and on Google AI Platform using their Training Job feature.

I must mention that i included bs4, nltk, lxml in my training package setup.py file as the required PyPI packages.

My training algorithm goes like this:

  1. Imported a CSV file of input strings and their labels (output) as a pandas dataframe (the model has 1 input variable, which is the string.)

  2. Preprocess the input strings using bs4 and nltk to remove unnecessary characters, stopwords, and make all the characters lowercase (to reproduce this simply use lowercase alphabet-only strings).

  3. Create a pipeline

    from sklearn.feature_extraction.text import TfidfVectorizer
    tvec=TfidfVectorizer()
    lclf = LogisticRegression(fit_intercept = False, random_state = 255, max_iter = 1000)
    from sklearn.pipeline import Pipeline
    model_1= Pipeline([('vect',tvec),('clf',lclf)])
    
  4. Do a cross-validation using GridSearchCV

    from sklearn.model_selection import GridSearchCV
    
    param_grid = [{'vect__ngram_range' : [(1, 1)],
           'clf__penalty' : ['l1', 'l2'],
           'clf__C' : [1.0, 10.0, 100.0]},
          {'vect__ngram_range' : [(1, 1)],
           'clf__penalty' : ['l1', 'l2'],
           'clf__C' : [1.0, 10.0, 100.0],
           'vect__use_idf' : [False],
           'vect__norm' : [False]}]
    
    gs_lr_tfidf = GridSearchCV(model_1, param_grid, scoring='accuracy', 
    cv=5, verbose=1, n_jobs=-1)
    gs_lr_tfidf.fit(X_train, y_train)
    
  5. Get my desired model with the best estimation. This is the model saved in the Google model.joblib file.

    clf = gs_lr_tfidf.best_estimator_
    

I can output a simple prediction on my Jupyter Notebook file using

predicted = clf.predict(["INPUT STRING"])
print(predicted)

It prints the predicted label for my input string. Such as ['good'] or ['bad']

But while the model was successfully trained and submitted to the AI Platform, when i try to request a prediction such as (in the required JSON format):

["the quick brown fox jumps over the lazy dog"]
["hi what is up"]

The shell returns with this error:

{
  "error": "Prediction failed: Exception during sklearn prediction: 
  'numpy.ndarray' object has no attribute 'lower'"
}

What could have possibly gone wrong here?

Is this possibly a problem with the dependencies, that i too must install packages for bs4, lxml and nltk in my google-hosted model?

Or is my input JSON incorrectly formatted?

Thanks for your help.

1 Answers1

0

Alright, i found out that indeed the JSON format is incorrectly formatted. (answered on https://stackoverflow.com/a/51693619/10570541)

As with the official documentation states that the JSON format has newlines and square brackets to separate instances, such with:

[6.8,  2.8,  4.8,  1.4]
[6.0,  3.4,  4.5,  1.6]

That applies if you have more than one input variable.

For one input variable only, simply use just newlines.

"the quick brown fox jumps over the lazy dog"
"alright it works"