Serialize/Deserialize tfidf-vectorizer with custom tokenizer on AWS

Question

Good morning!

I am currently working with an TfidfVectorizer from sklearn and a customized tokenizer. The idea is to create a pickled TfidfVectorizer and to load this vectorizer into a AWS Lambda-function, which transforms text input.

The problem is: On my local machine it works fine: I am able to load the vectorizer from the S3-bucket, deserialize it, create a new vectorizer-object and use it to transform text. On AWS it does not work. It seems that it cannot load my customized tokenizer, I always get an AttributeError.

I already tried using a lambda function and the dill-pickler, but it does not work on AWS either. It cannot find the PorterStemmer-module I use in my customized tokenizer.

The serialized TfidfVectorizer(I serialized it on my local machine):

import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer

def tokenizer_porter(text):
    porter = PorterStemmer()
    return [porter.stem(word) for word in text.split()]

tfidf = TfidfVectorizer(ngram_range=(1, 1), stop_words=None, tokenizer=tokenizer_porter)

tfidf.fit(X)

pickle.dump(tfidf, open(pickle_path + 'tfidf_vect.pkl', 'wb'), protocol=4)

The deserialization (in the AWS Lambda-service):

def tokenizer_porter(text):
    porter = PorterStemmer()
    return [porter.stem(word) for word in text.split()]

def load_model_from_bucket(key, bucket_name):
    s3 = boto3.resource('s3')
    complete_key = 'serialized_models/' + key
    res = s3.meta.client.get_object(Bucket=bucket_name, Key=complete_key)
    model_str = res['Body'].read()
    model = pickle.loads(model_str)
    return model

tfidf = load_model_from_bucket('tfidf_vect.pkl', bucket_name)

tfidf.transform(text_data)

In AWS Cloudwatch I get this Traceback:

Can't get attribute 'tokenizer_porter' on <module '__main__' from '/var/runtime/awslambda/bootstrap.py'>: AttributeError
Traceback (most recent call last):
File "/var/task/handler.py", line 56, in index
tfidf = load_model_from_bucket('tfidf_vect.pkl', bucket_name)
File "/var/task/handler.py", line 35, in load_model_from_bucket
model = pickle.loads(model_str)
AttributeError: Can't get attribute 'tokenizer_porter' on <module '__main__' from '/var/runtime/awslambda/bootstrap.py'>

Do you have any ideas what I am doing wrong?

EDIT: I chose to do the tfidf-vectorization within the AWS Lambda-skript without using the pickle-serialization, which is a little bit more computational expensive, but it works without causing problems.

Hey! Sorry for the late reply! No, I am still doing the tfidf-vectorization within the Lambda-script, because I didn't find time to work on this problem. — M.B., Oct 04 '18 at 06:55

score 1 · Answer 1 · answered Oct 05 '18 at 21:40

I found a solution that worked for my Heroku app based on these two references:

AttributeError when reading a pickle file

Failed to find application object 'server' in 'app'

Basically, for my two pickles (file1.pickle and file2.pickle), I changed the way files are read adding this:

class MyCustomUnpickler(pickle.Unpickler):
    def find_class(self, module, name):
        if module == "__main__":
            module = "app"
        return super().find_class(module, name)

with open('file1.pickle', 'rb') as f:
    unpickler = MyCustomUnpickler(f)
    object1 = unpickler.load()

with open('file2.pickle', 'rb') as f:
    unpickler = MyCustomUnpickler(f)
    object2 = unpickler.load()

And also added this after app = dash.Dash(__name__):

server = app.server

More detailed explanations on the links above.

Serialize/Deserialize tfidf-vectorizer with custom tokenizer on AWS

1 Answers1