I've created a text classifier in Keras, and I can train the Keras model on Cloud ML just fine: the model is subsequently deployed on Cloud ML. However, when passing along text to classify, it returns the wrong classifications: I suspect strongly that it's not using the same tokenizer/word index that I have used when creating the keras classifier, and that was used to tokenise the new text.
I'm unsure how to pass along the tokeniser/word index to Cloud ML when training: there is a previous SO question, but will
gcloud ml-engine jobs submit training
pick up a pickle or text file containing the word index mapping? And if so, how should I configure the setup.py file?
EDIT:
So, I'm using Keras to tokenise input text like so:
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(X_train)
sequences = tokenizer.texts_to_sequences(X_train)
word_index = tokenizer.word_index
If I'm just loading a Keras model locally, I save the model like so:
model.save('model_embeddings_20epochs_v2.h5')
I also save the tokenizer, so that I can use it to tokenize new data:
with open("../saved_models/keras_tokenizer_embeddings_002.pickle", "wb") as f:
pickle.dump(tokenizer, f)
On new data, I restore the model and tokenizer.
model = load_model('../../saved_models/model_embeddings_20epochs_v2.h5')
with open("../../saved_models/keras_tokenizer_embeddings_002.pickle", "rb") as f:
tokenizer = pickle.load(f)
I then use the tokenizer to convert text to sequences on the new data, classify etc.
The script for the Cloud ML job does not save the tokenizer - I presumed that Keras script basically used the same word index.
....
X_train = [x.encode('UTF8') for x in X_train]
X_test = [x.encode('UTF8') for x in X_test]
# finally, vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(X_train)
sequences = tokenizer.texts_to_sequences(X_train)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
.....
# prepare embedding matrix
num_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
if i >= MAX_NB_WORDS:
continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
At the moment, I'm just training it locally.
gcloud ml-engine local train \
--job-dir $JOB_DIR \
--module-name trainer.multiclass_glove_embeddings_v1 \
--package-path ./trainer \
-- \
--train-file ./data/corpus.pkl