Good morning!
I am currently working with an TfidfVectorizer from sklearn and a customized tokenizer. The idea is to create a pickled TfidfVectorizer and to load this vectorizer into a AWS Lambda-function, which transforms text input.
The problem is: On my local machine it works fine: I am able to load the vectorizer from the S3-bucket, deserialize it, create a new vectorizer-object and use it to transform text. On AWS it does not work. It seems that it cannot load my customized tokenizer, I always get an AttributeError.
I already tried using a lambda function and the dill-pickler, but it does not work on AWS either. It cannot find the PorterStemmer-module I use in my customized tokenizer.
The serialized TfidfVectorizer(I serialized it on my local machine):
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
def tokenizer_porter(text):
porter = PorterStemmer()
return [porter.stem(word) for word in text.split()]
tfidf = TfidfVectorizer(ngram_range=(1, 1), stop_words=None, tokenizer=tokenizer_porter)
tfidf.fit(X)
pickle.dump(tfidf, open(pickle_path + 'tfidf_vect.pkl', 'wb'), protocol=4)
The deserialization (in the AWS Lambda-service):
def tokenizer_porter(text):
porter = PorterStemmer()
return [porter.stem(word) for word in text.split()]
def load_model_from_bucket(key, bucket_name):
s3 = boto3.resource('s3')
complete_key = 'serialized_models/' + key
res = s3.meta.client.get_object(Bucket=bucket_name, Key=complete_key)
model_str = res['Body'].read()
model = pickle.loads(model_str)
return model
tfidf = load_model_from_bucket('tfidf_vect.pkl', bucket_name)
tfidf.transform(text_data)
In AWS Cloudwatch I get this Traceback:
Can't get attribute 'tokenizer_porter' on <module '__main__' from '/var/runtime/awslambda/bootstrap.py'>: AttributeError
Traceback (most recent call last):
File "/var/task/handler.py", line 56, in index
tfidf = load_model_from_bucket('tfidf_vect.pkl', bucket_name)
File "/var/task/handler.py", line 35, in load_model_from_bucket
model = pickle.loads(model_str)
AttributeError: Can't get attribute 'tokenizer_porter' on <module '__main__' from '/var/runtime/awslambda/bootstrap.py'>
Do you have any ideas what I am doing wrong?
EDIT: I chose to do the tfidf-vectorization within the AWS Lambda-skript without using the pickle-serialization, which is a little bit more computational expensive, but it works without causing problems.