0

I successfully train my https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census model/experiment both locally and in the cloud. And I'm able to deploy my sample and run predictions in the cloud.

But if I wanted to run my predictions locally - not in the cloud - how would I go about that?

I'm a novice, but I have tried a couple of naive approaches, all failing, see below for 3 specific ones.

Any hint or reference to snippets are welcome.

:-)

M.

** update regarding approach #1 in the original post**

If I include the single line;

c = tf.contrib.learn.DNNLinearCombinedClassifier(model_dir=job_dir)

I get an error, see error #a below.

If I naively edit the call to include the missing parameter the constructor works, but if I call predict that fails with error #b, see below. I make wide_columns and deep_columns in model.py global and modify the above line to be

c = tf.contrib.learn.DNNLinearCombinedClassifier(model_dir=job_dir, linear_feature_columns=model.wide_columns, dnn_feature_columns=model.deep_columns)

My pycharm debugger confirms the model.wide_columns and model.deep_columns are instantiated/not empty at the time of the call.

Now this leads to an "empty" classifier. I do not believe the DNNLinearCombinedClassifier picks up any model content from my job_dir. I would've included screenshots from inspecting the classifier whilst being instantiated in model.py build_estimator() (I made it into a variable c there as well, and had a breakpoint) and from the above c in task.py, but I'm not allowed by github due to my lack of reputation. But the difference is obvious - e.g. the c->params->dnn_hidden_units is empty for the restored classifier, but instantiated ([100,70,48,34]) with the original classifier.

I include an ls -R for the job_dir (called output), see #c below.

And I do rm -rf output for each run so the job_dir is clean.

Clearly I err somewhere, but in my lack of insight I'm unable to see where. Any further advice is appreciated.

:-)

M.

----------------------- console output (update) --------------------------

a.

Starting Census: Please lauch tensorboard to see results:
tensorboard --logdir=$MODEL_DIR
2017-05-30 12:14:10.570030: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-30 12:14:10.570042: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-30 12:14:10.570046: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
Traceback (most recent call last):
  File "<..>/trainer/task.py", line 199, in <module>
    c = tf.contrib.learn.DNNLinearCombinedClassifier(model_dir=job_dir)
  File "<..>/.local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 335, in new_func
    return func(*args, **kwargs)
  File "<..>/.local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py", line 597, in __init__
    raise ValueError("Either linear_feature_columns or dnn_feature_columns "
ValueError: Either linear_feature_columns or dnn_feature_columns must be defined.

Process finished with exit code 1

b.

Starting Census: Please lauch tensorboard to see results:
tensorboard --logdir=$MODEL_DIR
2017-05-30 12:31:47.967638: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-30 12:31:47.967650: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-30 12:31:47.967653: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
Traceback (most recent call last):
  File "<..>/repository/git/13cx/subject-matter/google-cloud/1705cloudml/170530local-save/trainer/task.py", line 206, in <module>
    p = c.predict(input_fn=eval2_input_fn)
  File "<..>/.local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 335, in new_func
    return func(*args, **kwargs)
  File "<..>/.local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 335, in new_func
    return func(*args, **kwargs)
  File "<..>/.local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py", line 660, in predict
    as_iterable=as_iterable)
  File "<..>/.local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 335, in new_func
    return func(*args, **kwargs)
  File "<..>/.local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py", line 695, in predict_classes
    as_iterable=as_iterable)
  File "<..>/.local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 281, in new_func
    return func(*args, **kwargs)
  File "<..>/.local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 565, in predict
    as_iterable=as_iterable)
  File "<..>/.local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 857, in _infer_model
    infer_ops = self._get_predict_ops(features)
  File "<..>/.local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1188, in _get_predict_ops
    return self._call_model_fn(features, labels, model_fn_lib.ModeKeys.INFER)
  File "<..>/.local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1103, in _call_model_fn
    model_fn_results = self._model_fn(features, labels, **kwargs)
  File "<..>/.local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py", line 201, in _dnn_linear_combined_model_fn
    "dnn_hidden_units must be defined when dnn_feature_columns is "
ValueError: dnn_hidden_units must be defined when dnn_feature_columns is specified.

Process finished with exit code 1

c.

$ ls -R output/
output/:
checkpoint                                     graph.pbtxt                          model.ckpt-2.data-00000-of-00001
eval                                           model.ckpt-1000.data-00000-of-00001  model.ckpt-2.index
events.out.tfevents.1496140978.yarc-mainlinux  model.ckpt-1000.index                model.ckpt-2.meta
export                                         model.ckpt-1000.meta

output/eval:
events.out.tfevents.1496140982.yarc-mainlinux  events.out.tfevents.1496140987.yarc-mainlinux

output/export:
Servo

output/export/Servo:
1496140989

output/export/Servo/1496140989:
saved_model.pb  variables

output/export/Servo/1496140989/variables:
variables.data-00000-of-00001  variables.index

----------** original post **----------

-------- things I've tried ------------

See at the bottom for code with references to 1, 2, 3..

  1. Reinstantiate the DNNLinearCombinedClassifier with a model_dir parameter pointing to where the model is stored. The plan was to run classifier's predict method. I'm not able to make the classifier reflect the saved model.

  2. Restore the model through saver.restore(). This works, but I don't understand how to proceed from there. Due to lack of tensorflow insight I guess.

  3. Produce some tests data for use with method 1. Evaluation of the tensors never exit. How do I evaluate an input batch, so that I may see it as a matrix?

--------- accompanying code -----------------

(this code is simply appended to the end of trainer/task.py)

  # last original line from task.py:
  learn_runner.run(generate_experiment_fn(**arguments), job_dir)

  # my stuff: 

  # 1. restore the classifier from model dir, fails
  # c = tf.contrib.learn.DNNLinearCombinedClassifier(model_dir=job_dir)

  # 2. restore model, works ok, but then how?
  sess = tf.Session()
  saver = tf.train.import_meta_graph('output/model.ckpt-1000.meta')
  saver.restore(sess, tf.train.latest_checkpoint('./output/'))
  sess.run(tf.global_variables_initializer())
  print("Sanity check, a variable instance {}".format(
      sess.run('dnn/input_from_feature_columns/education_embedding/weights/part_0:0')))
  sess.close()

  # 3. produce some test input (we're for simplicity reusing the eval set), apparently works, but an evaluation hangs forever
  eval2_input_fn = model.generate_input_fn(
      arguments['eval_files'],
      batch_size=arguments['eval_batch_size'],
      shuffle=False
  )

  # 3a. inspecting some input, the evaluation never ends.
  input = eval2_input_fn()
  print("input: {}".format(input))
  with tf.Session() as sess:
      evalinput = input[1].eval()
      print("evalinput: {}".format(evalinput))
  print("\nDone")
Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
yarc68000
  • 15
  • 5
  • Can you provide the information about the error message using approach #1? In addition, can you provide a recursive directory listing of job_dir? – rhaertel80 May 24 '17 at 06:39

3 Answers3

2

The simplest way is to use gcloud:

gcloud ml-engine local predict --model-dir output/export/Servo/1496140989 \ 
  --json-instances ../test.json
James Hirschorn
  • 7,032
  • 5
  • 45
  • 53
0

If performance is not a concern, you could just use the predict function directly (#1 above):

c = tf.contrib.learn.DNNLinearCombinedClassifier(model_dir=job_dir)
eval2_input_fn = model.generate_input_fn(
      arguments['eval_files'],
      batch_size=arguments['eval_batch_size'],
      shuffle=False
)
c.predict(input_fn=eval2_input_fn)

Or you can do things a little more manually:

class Predictor(object):

  def __init__(self, export_dir):
    self._sess = tf.Session()
    # Load the SavedModel
    meta = tf.saved_model.loader.load(self._sess, ['serve'], export_dir)
    # Map input aliases to the actual tensor names in the graph.
    inputs = meta.signature_def['serving_default'].inputs
    self._input_dict = {alias: info.name for alias, info in inputs.iteritems()}
    # Get the output aliases and tensor names
    outputs = meta.signature_def['serving_default'].outputs
    output_dict = [(alias, info.name) for alias, info in outputs.iteritems()]
    self._out_aliases, self._fetches = zip(*output_dict)

  def predict(self, examples):
    """Perform prediction on a list of examples (dicts)"""
    # Convert the list of examples to a feed dict by converting the rows to columns
    # and changing the tensor aliases to actual tensor names.
    columns = self._columnarize(examples)
    feed_dict = {self._input_dict[name]: val for name, val in columns.iteritems()}
    # Perform the actual prediction.
    fetched = self._sess.run(self._fetches, feed_dict)
    # Convert the fetched data to friendlier row-based output whose keys are
    # the output names/aliases.
    output_dict = dict(zip(self._out_aliases, fetched))
    return self._rowify(output_dict)

  def _columnarize(self, examples):
    """Convert a list of dicts to a dict of lists."""
    columns = collections.defaultdict(list)
    for example in examples:
      for name, val in example.iteritems():
        columns[name].append(val)
    return columns

  def _rowify(self, output_dict):
    """Convert a dict of lists to a list of dicts."""
    rows = []
    row_values = zip(*output_dict.values())
    for row in row_values:
      # Convert the row data to a dict
      rows.append(dict(zip(output_dict.keys(), row)))
    return rows

# Be sure to set the last path element to the correct value.
export_dir = os.path.join(job_dir, 'export', 'Servo', '1496140989')
p = Predictor(export_dir)  

# Create an example. Note the space before strings due to the way
# the CSV file is parsed during training.
example = {'age': 42,
           'workclass': ' Private',
           'education': ' Masters',
           'education_num': 14,
           'marital_status': ' Never-married',
           'occupation': ' Adm-clerical',
           'relationship': ' Not-in-family',
           'race': ' White',
           'gender': ' Male',
           'capital_gain': 0,
           'capital_loss': 0,
           'hours_per_week': 42,
           'native_country': ' United-States'}
p.predict([example])

[{u'probabilities': array([ 0.90454769, 0.09545235], dtype=float32), u'logits': array([-2.24880791], dtype=float32), u'classes': 0, u'logistic': array([ 0.09545235], dtype=float32)}]

The hang is probably because you need to start "queue runners".

with tf.Session() as sess:
  # Start populating the filename queue.
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)

  print(sess.run(...))

  coord.request_stop()
  coord.join(threads)

That said, it's a little tricky to print out the inputs when using queues.

rhaertel80
  • 8,254
  • 1
  • 31
  • 47
  • Thanks a lot for a response. :-) I've included an update to address the questions you've asked (the job_dir and details on why #1 doesn't work) – yarc68000 May 30 '17 at 11:08
0

You can use the Estimator itself to do the predict (though this is not fast enough for production usage).

Two things need to be careful:

  • Make sure your the model_dir has the checkpoint, which is saved by the training process. Predict will load the parameters from the checkpoint in order to really predict something.

  • You need to construct the Estimator with the same setting as the training.

The easiest way to do this (given the example provided by the cloudml-samples) is to

  1. construct the Experiment by using the same setting as your training process
  2. take the estimator from the Experiment (this ensures that the estimator is constructed in the same way as the training)
  3. prepare the input_fn for prediction and call the predict

With using the Estimator itself, you need to use local python as it cannot take advantage of the google cloud.

In the following example, I commented out the learn_runner.run to disable the training (assuming you have trained your model as saved the checkpoint into the job_dir), then used the numpy_input_fn to prepare the data for predict.

  ## Commented out the learn_runner run to do predict.
  ## Now the code can only work with local python.
  # learn_runner.run(generate_experiment_fn(**arguments), job_dir)

  # Change the code to construct the Estimator with exactly the same setting as
  # distributed training (with Experiment) but take the Estimator out and call
  # the predict expliclity.
  experiment_fn = generate_experiment_fn(**arguments)
  experiment = experiment_fn(job_dir)
  print("Using estimator to predict")
  estimator = experiment.estimator

  # The data contains two items.    
  data = {
      'age': [42, 47],
      'workclass': ['Private', 'Private'],
      'education': ['Masters', 'Prof-school'],
      'education_num': [14, 15],
      'marital_status': ['Never-married', 'Married-civ-spouse'],
      'occupation': ['Adm-clerical', 'Prof-specialty'],
      'relationship': ['Not-in-family', 'Wife'],
      'race': ['White', 'White'],
      'gender': ['Male', 'Female'],
      'capital_gain': [0, 0],
      'capital_loss': [0, 1902],
      'hours_per_week': [42, 60],
      'native_country': ['United-States', 'Honduras'],
  }

  import numpy as np

  for k,v in data.items():
    # Convert each column to numpy array and make sure it has rank 2, which is
    # required by the DNNCombinedLinearClassifier.
    data[k] = np.expand_dims(np.array(v), -1)

  predict_input_fn = tf.contrib.learn.io.numpy_input_fn(
      x=data, shuffle=False, num_epochs=1)

  for predicted_item in estimator.predict(input_fn=predict_input_fn):
    print('Predication: {}'.format(predicted_item))
J. Xie
  • 86
  • 4