How to store a dictionary and map words to ints when using Tensorflow Serving?

Question

I have trained an LSTM RNN classification model on Tensorflow. I was saving and restoring checkpoints to retrain and use the model for testing. Now I want to use Tensorflow serving so that I can use the model in production.

Initially, I would parse through a corpus to create my dictionary which is then used to map words in a string to integers. I would then store this dictionary in a pickle file which could be reloaded when restoring a checkpoint and retraining on a data set or just for using the model so that the mapping is consistent. How do I store this dictionary when saving the model using SavedModelBuilder?

My code for the neural network is as follows. The code for saving the model is towards the end (I am including an overview of the whole structure for context):

...


# Read files and store them in variables
with open('./someReview.txt', 'r') as f:
    reviews = f.read()
with open('./someLabels.txt', 'r') as f:
    labels = f.read()

...

#Pre-processing functions
#Parse through dataset and create a vocabulary
vocab_to_int, reviews = RnnPreprocessing.map_vocab_to_int(reviews)
with open(pickle_path, 'wb') as handle:
    pickle.dump(vocab_to_int, handle, protocol=pickle.HIGHEST_PROTOCOL)

#More preprocessing functions
...


# Building the graph
lstm_size = 256
lstm_layers = 2
batch_size = 1000
learning_rate = 0.01            
n_words = len(vocab_to_int) + 1 

# Create the graph object
tf.reset_default_graph()
with tf.name_scope('inputs'):
    inputs_ = tf.placeholder(tf.int32, [None, None], name="inputs")
    labels_ = tf.placeholder(tf.int32, [None, None], name="labels")
    keep_prob = tf.placeholder(tf.float32, name="keep_prob")

#Create embedding layer LSTM cell, LSTM Layers

...

# Forward pass
with tf.name_scope("RNN_forward"):
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)


# Output. We are only interested in the latest output of the lstm cell
with tf.name_scope('predictions'):
    predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
    tf.summary.histogram('predictions', predictions)
#More functions for cost, accuracy, optimizer initialization

... 

# Training
epochs = 1
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        state = sess.run(initial_state)

        for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
            feed = {inputs_: x,
                    labels_: y[:, None],
                    keep_prob: 0.5,
                    initial_state: state}
            summary, loss, state, _ = sess.run([merged, cost, final_state, optimizer], feed_dict=feed)

            train_writer.add_summary(summary, iteration)

            if iteration%1==0:
                print("Epoch: {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {:.3f}".format(loss))

            if iteration%2==0:
                val_acc = []
                val_state = sess.run(cell.zero_state(batch_size, tf.float32))
                for x, y in get_batches(val_x, val_y, batch_size):
                    feed = {inputs_: x,
                            labels_: y[:, None],
                            keep_prob: 1,
                            initial_state: val_state}
                    summary, batch_acc, val_state = sess.run([merged, accuracy, final_state], feed_dict=feed)
                    val_acc.append(batch_acc)
                print("Val acc: {:.3f}".format(np.mean(val_acc)))
            iteration +=1
            test_writer.add_summary(summary, iteration)



    #Saving the model
    export_path = './SavedModel'
    print ('Exporting trained model to %s'%(export_path))

    builder = saved_model_builder.SavedModelBuilder(export_path)

    # Build the signature_def_map.    
    classification_inputs = utils.build_tensor_info(inputs_)
    classification_outputs_classes = utils.build_tensor_info(labels_)

    classification_signature = signature_def_utils.build_signature_def(
        inputs={signature_constants.CLASSIFY_INPUTS: classification_inputs},
        outputs={
          signature_constants.CLASSIFY_OUTPUT_CLASSES:
              classification_outputs_classes,
        },
      method_name=signature_constants.CLASSIFY_METHOD_NAME)


    legacy_init_op = tf.group(
        tf.tables_initializer(), name='legacy_init_op')
    #add the sigs to the servable
    builder.add_meta_graph_and_variables(
        sess, [tag_constants.SERVING],
        signature_def_map={
            signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
                classification_signature
        },
        legacy_init_op=legacy_init_op)
    print ("added meta graph and variables")

    #save it!
    builder.save()
    print("model saved")

I am not entirely sure if this is the correct way to save a model such as this but this is the only implementation I have found in the documentation and online tutorials.

I haven't found any example or any explicit guide to saving the dictionary or how to use it when restoring a savedModel in the documentation.

When using checkpoints, I would just load the pickle file before running the session. How do I restore this savedModel so that I can use the same word to int mapping using the dictionary? Is there any specific way I should be saving the model or loading it?

I have also added inputs_ as the input for the input signature. This is a sequence of integeres 'after' the words have been mapped. I can't specify a string as input because I get an AttributeError: 'str' object has no attribute 'dtype' . In such cases, how exactly are words mapped to integers in models that are in production?

score 0 · Answer 1 · answered Nov 20 '17 at 20:11

0

Implement your preprocessing using the utilities in tf.feature_column and it'll be straightforward to use the same mapping to integers in serving.

answered Nov 20 '17 at 20:11

Alexandre Passos

5,186
1
14
19

Even then, you'd still need a lexicon or vocabulary of words that you can reference when you are mapping all of the words in an input to integer unless I am missing something. Where would you store this dictionary. Can you please elaborate? – skbrhmn Nov 22 '17 at 15:43

score 0 · Answer 2 · answered May 14 '18 at 16:55

One approach to this is storing the vocabulary in the model's graph. This will then be shipped with the model.

...


vocab_table = lookup.index_table_from_file(vocabulary_file='data/vocab.csv', num_oov_buckets=1, default_value=-1)
text = features[commons.FEATURE_COL]
words = tf.string_split(text)
dense_words = tf.sparse_tensor_to_dense(words, default_value=commons.PAD_WORD)
word_ids = vocab_table.lookup(dense_words)

padding = tf.constant([[0, 0], [0, commons.MAX_DOCUMENT_LENGTH]])
# Pad all the word_ids entries to the maximum document length
word_ids_padded = tf.pad(word_ids, padding)
word_id_vector = tf.slice(word_ids_padded, [0, 0], [-1, commons.MAX_DOCUMENT_LENGTH])

Source: https://github.com/KishoreKarunakaran/CloudML-Serving/blob/master/text/imdb_cnn/model/cnn_model.py#L83

How to store a dictionary and map words to ints when using Tensorflow Serving?

2 Answers2