We are using tensorflow C API version 1.13.1 from here https://www.tensorflow.org/install/lang_c
Our neural network model frozen_graph.pb size is 230mb with MobileNet architecture. When we loading it, tensorflow allocates about 1.1 gb on first session run, and than memory allocation reduces to ~900mb and remains in this value.
We tried graph transform tool from here https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md to optimize graph. But only quantization seems to be effective to reduce also model size, also memory usage.But occasionally we can't use it as quantization reduces model accuracy by 15%.
As we currently think the only way to reduce model size and do not drastically affect accuracy is.
1)Move to another back end like MXNET etc...
2)Use knowledge distillation technique to retrain small models. https://arxiv.org/pdf/1503.02531.pdf
We are expecting that memory allocation for single model do not exceeds 150% of binary size. Any solution acceptable. Thank you.