31

I changed from tf.train.Saver to the SavedModel format which surprisingly means loading my model from disk is a lot slower (instead of a couple of seconds it takes minutes). Why is this and what can I do to load the model faster?

I used to do this:

# Save model
saver = tf.train.Saver()
save_path = saver.save(session, model_path)

# Load model
saver = tf.train.import_meta_graph(model_path + '.meta')
saver.restore(session, model_path)

But now I do this:

# Save model
builder = tf.saved_model.builder.SavedModelBuilder(model_path)
builder.add_meta_graph_and_variables(session, [tf.saved_model.tag_constants.TRAINING])
builder.save()

# Load model
tf.saved_model.loader.load(session, [tf.saved_model.tag_constants.TRAINING], model_path)
bR3nD4n
  • 111
  • 10
Carl Thomé
  • 2,703
  • 3
  • 19
  • 41
  • Could you put together a graph/model that illustrates the issue? – Allen Lavoie May 30 '17 at 16:34
  • I tried reproducing the slowdown with ResNet50, VGG16 and Inception-v4 but they all loaded about as fast with both tf.train.Saver and SavedModel. Are you using any RNNs perhaps? Or some control flow like tf.while_loop? – Carl Thomé Sep 20 '17 at 08:57
  • What order of magnitude are we talking about? My model is just a fine-tuning of inception-v3, how long time did it take to load your version? Multiple minutes? It is possible this is the correct amount of time, i'm trying to gauge where the problem is. – bw4sz Sep 21 '17 at 18:15
  • Compare the sizes of the resulted saved files. – igrinis Sep 25 '17 at 05:33
  • Does this occur during just training? Or does this happen at serving too? Also I'm looking at the underlying methods and the proto's that they are constructing. The old method returned a much smaller proto than the new one. See [THIS](https://github.com/tensorflow/tensorflow/blob/r1.3/tensorflow/core/protobuf/saved_model.proto) and [THIS](https://github.com/tensorflow/tensorflow/blob/r1.3/tensorflow/core/protobuf/meta_graph.proto). Looks like the new version allows repeated whole "models" in the same proto. – bR3nD4n Sep 28 '17 at 01:56
  • Are you able to profile the CPU or IO during the operation? Simply, does htop show a CPU / core pegged for the whole time? Does iotop show a constant operation? My guess is that there is something unoptimised happening during the slow operation that is causing a lot of unnecessary CPU work. This could be a poor algorithm for extracting the information that is needed, or something weird in the marshalling to output. First step though, check the server performance. – Phil Nov 06 '17 at 14:32

2 Answers2

2

I am by no ways an expert in Tensorflow, but if I had to take a guess as to why this is happening, I would say that:

  • tf.train.Saver(), saves a complete meta-graph. Therefore, all the information needed to perform any operations contained in your graph is already there. All tensorflow needs to do to load the model, is insert the meta-graph into the default/current graph and you're good to go.
  • The SavedModelBuilder() on the other hand, behind the scene creates a language agnostic representation of your operations and variables. Which means that the loading method has to extract all the information, then recreate all the operation and variables from your previous graph, and insert them into the default/current graph.

Depending on the size of your graph, recreating everything that it contained might take some time.

Concerning the second question, as @J H said, if there are no reasons for you to use one strategy over the other, and time is of the essence, then just go with the fastest one.

domochevski
  • 553
  • 5
  • 13
1

what can I do to load the model faster?

Switch back to tf.train.Saver, as your question shows no motivations for using SavedModelBuilder, and makes it clear that elapsed time matters to you. Alternatively, an MCVE that reproduced the timing issue would allow others to collaborate with you on profiling, diagnosing, and fixing any perceived performance issue.

J_H
  • 17,926
  • 4
  • 24
  • 44