how to debug invalid utf-8 in protobuf?

Question

i'm working with some tensorflow code and trying to load a trained checkpoint, but it's failing with a protobuf error like this:

[libprotobuf ERROR google/protobuf/wire_format_lite.cc:577] String field 'tensorflow.TensorShapeProto.Dim.name' contains invalid UTF-8 data when parsing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes. 
Traceback (most recent call last):
  [...]
  File "/home/sopi/miniconda3/envs/magenta2/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3053, in _as_graph_def
    graph.ParseFromString(compat.as_bytes(data))
google.protobuf.message.DecodeError: Error parsing message

in order to debug the training code that apparently is producing invalid utf-8, i'd like to know what the invalid data in question actually looks like. stepping through the code in pdb doesn't get me very far since ParseFromString() is implemented in C++.

how can i find out what the invalid utf-8 data is? or even the position in the byte array at which the error occurred?

(in this case, graph is a tensorflow.core.framework.graph_pb2.GraphDef, which is a subclass of google.protobuf.message.Message. but my question concerns protobuf parsing in general and i don't think there's anything special about GraphDef in this respect)

comment out the line producing the error and add following line : `print(compat.as_bytes(data))` — TheEagle, Jan 04 '21 at 15:21
`data` is about 2 GB in size, i'm hoping not to have to go through all of it manually. — ahihi, Jan 04 '21 at 15:23
ok, then try to convert it to bytes with python built-in function `bytes(data)` — TheEagle, Jan 04 '21 at 15:24
TL;DR : please post the WHOLE traceback and not just the last line — TheEagle, Jan 04 '21 at 15:26
i left it out because it's not really relevant. i edited the question a bit to clarify this. — ahihi, Jan 04 '21 at 15:48

how to debug invalid utf-8 in protobuf?

0 Answers0