Corrupted stack detected inside Tensorflow Lite Micro interpreter->Invoke() call with Mobilenet_V1_0.25_224_quant model

Question

I am trying to use the quantized model with Tensorflow Lite Micro, and got a segmentation error inside interpreter->Invoke() call.

Debugger showed that segmentation error occurred on returning from Eval() in conv.cc on Node 28 of CONV_2D, and stack was corrupted. Error message is *** stack smashing detected ***: <unknown> terminated with compiler flags "-fstack-protector-all -Wstack-protector".

My test was simply from the person detection example with model replaced with Mobilenet_V1_0.25_224_quant at the Tensorflow lite pre-trained models site with increased enough kTensorArenaSize and model input/output size changed to 224x224x3 and 1x1001, amd pulled additional required operators.

Also tried a few different models, at another quantified mode Mobilenet_V1_0.25_192_quant is showing the same segfault problem, But the regular floating point modes Mobilenet_V1_0.25_192, and Mobilenet_V1_0.25_224 run OK with many loops.

Have anyone seen similar problem ? Or is some limitations on Tensorflow Lite Micro that I should be aware of ?

This problem can be reproduced at this commit of forked tensorflow repo.

Build command:

$ bazel build //tensorflow/lite/micro/examples/person_detection:person_detection       -c dbg --copt=-fstack-protector-all --copt=-Wstack-protector --copt=-fno-omit-frame-pointer

And run:

$ ./bazel-bin/tensorflow/lite/micro/examples/person_detection/person_detection

Files changed:

tensorflow/lite/micro/examples/person_detection/main_functions.cc
tensorflow/lite/micro/examples/person_detection/model_settings.h 
tensorflow/lite/micro/examples/person_detection/person_detect_model_data.cc

Changes in main_functions.cc:

constexpr int kTensorArenaSize = 1400 * 1024;
static tflite::MicroOpResolver<5> micro_op_resolver;
micro_op_resolver.AddBuiltin(tflite::BuiltinOperator_RESHAPE,
                             tflite::ops::micro::Register_RESHAPE());
micro_op_resolver.AddBuiltin(tflite::BuiltinOperator_SOFTMAX,
                             tflite::ops::micro::Register_SOFTMAX(), 1, 2);

Changes in model_settings.h

constexpr int kNumCols = 224;
constexpr int kNumRows = 224;
constexpr int kNumChannels = 3;
constexpr int kCategoryCount = 1001;

The last model data file person_detect_model_data.cc is pretty big, please see full file at github.

March 28, 2020: Also tested on Raspberry Pi 3, results are same as on the x86 Ubuntu 18.04.

pi@raspberrypi:~/tests $ ./person_detection 
*** stack smashing detected ***: <unknown> terminated
Aborted

Thanks for your help.

Problem root cause found - Updated on April 2, 2020:

I found that the problem is caused by an array overrun of the layer operation data. Tensorflow microlite has a hidden limit (or I missed document, at least TF microlite runtime does not check) on output channels to maximum 256 in the OpData structure of conv.cc for TF micro lite.

constexpr int kMaxChannels = 256;
....
struct OpData {
...
  // Per channel output multiplier and shift.
  // TODO(b/141139247): Allocate these dynamically when possible.
  int32_t per_channel_output_multiplier[kMaxChannels];
  int32_t per_channel_output_shift[kMaxChannels];
...
}

The mobilenet model Mobilenet_V1_0.25_224_quant.tflite is with 1000 output classes, and total of 1001 channels internally. And it caused stack corruption in tflite::PopulateConvolutionQuantizationParams() of tensorflow/lite/kernels/kernel_util.cc:90 for the last Conv2D with output size of 1001.

No problem for TF, and TF lite as they are believed not using this structure definition.

Confirmed with increasing the channels to 1024 on loops of model evaluation calls.

Although most of TF microlite cases are likely with small models, and probably won't run into this problem.

This limit may be better documented and/or to perform check at run-time ?

Would you add your code to this, or a relevant snippet? If you point to the official repo and it changes tomorrow, then readers will be wondering why the code they are looking at does not exhibit the problems you are referring to. — halfer, Mar 23 '20 at 01:24
Thanks for the comments. I added code, build and run commands to reproduce the problem. The data file for the problem is pretty bit, so I attached a link of the full files that changed. Could we reopen this question for answers? Thanks. — Up Seattle, Mar 23 '20 at 03:36
Thank you for editing it. I don't think it should have been reopened - links to files on Google Drives tend to break over time, and we need the question (and its answers) to be readable forever. Nevertheless, I don't plan to vote to close again, as the editing effort is appreciated. — halfer, Mar 23 '20 at 10:19
Sorry you're hitting problems Jeff! I took a quick look, and it does look like a potential bug with TFL Micro. One thing that's surprising is that you're running it on an x86 platform though. Is this just for debugging purposes or are you hitting a similar issue on an MCU? I'm asking because an on-device issue is likely to be higher priority. Also, you might find https://gitter.im/tensorflow/sig-micro useful for more open-ended discussions. — Pete Warden, Mar 28 '20 at 02:50
Thanks, Pete. You are right. running on an x86 platform is for debugging. Goal is to deploy some models on an MCU. Last couple of days, I set up a Raspberry Pi 3 and cross compiler toolchain, the results on Pi 3 are the same as on x86. I am updating the question to include the tests on Pi 3. Also, after spent some more time on the problem, I submitted [an issue](https://github.com/tensorflow/tensorflow/issues/37910) at Tensorflow repo 2 days later. — Up Seattle, Mar 28 '20 at 17:41
Problem root cause is found, see the update of the question on April 2, 2020. — Up Seattle, Apr 03 '20 at 06:02

Corrupted stack detected inside Tensorflow Lite Micro interpreter->Invoke() call with Mobilenet_V1_0.25_224_quant model

0 Answers0