I am trying to make a DNNCLassifier that takes categorical inputs using tensor flow to train a model on the Google Cloud Platform (GCP). I have a few categorical feature columns where I use a vocabulary.txt file. For example:
tf.feature_column.categorical_column_with_vocabulary_file(
key = "feature_name",
vocabulary_file = vocab_file,
vocabulary_size = vocab_size
),
I spent several frustrating hours discovering that you can't use open() in GCP because it can't handle the gs://. Therefore, I used the following code to read in vocabulary files:
def read_vocab_file(file_path):
"""Reads a vocab file to memeory.
Args:
file_path: path to Vocab file in cloud storage bucket
Returns:
Vocab list, the size of the vocabulary """
with file_io.FileIO(file_path, 'r') as f:
#vocab_lines = open(f,'r').readlines()
vocab_lines = f.readlines()
vocab_size = len(vocab_lines)
return vocab_lines, vocab_size
This allows me to submit a training job where I pass the path to the vocabulary files as an argument.
gcloud ml-engine jobs submit training $JOB_NAME \ --job-dir $MODEL_DIR \ --runtime-version 1.4 \ --module-name trainer.task \ --package-path trainer/ \ --region $REGION \ -- \ --train-files $TRAIN_DATA \ --eval-files $EVAL_DATA \ --vocab-paths $VOCAB\ --latlon-data-paths $LATLON\ --train-steps 1000 \ --eval-steps 100
This works fine for training, but then I am not able to make predictions. Is there a better way to train a model in the google cloud machine learning engine environment while using vocab.txt files to create categorical feature columns?
Any example code that uses categorical features with a tf.estimator.DNNCLassifier would be greatly appreciated especially if it can run on GCP with hyperparameter optimization and make predictions in the cloud.
Thank you