1

I am trying to create my own training data for TextSum model. As my understanding, I need to put my articles and abstracts to a binary file (in TFRecords). However, I can not create my own training data from raw text files. I don't understand format very clearly, so I am trying to create a very simple binary file using the following code:

files = os.listdir(path)
writer = tf.python_io.TFRecordWriter("test_data")
for i, file in enumerate(files):
    content = open(os.path.join(path, file), "r").read()
    example = tf.train.Example(
        features = tf.train.Features(
            feature = {
                'content': tf.train.Feature(bytes_list=tf.train.BytesList(value=[content]))
            }
        )
    )

    serialized = example.SerializeToString()
    writer.write(serialized)

And I try to use the following code to read out the value of this test_data file

reader = open("test_data", 'rb')
len_bytes = reader.read(8)
str_len = struct.unpack('q', len_bytes)[0]
example_str = struct.unpack('%ds' % str_len, reader.read(str_len))[0]
example_pb2.Example.FromString(example_str)

But I always get the following error:

  File "dailymail_corpus_to_tfrecords.py", line 34, in check_file
    example_pb2.Example.FromString(example_str)
  File "/home/s1510032/anaconda2/lib/python2.7/site-packages/google/protobuf/internal/python_message.py", line 770, in FromString
    message.MergeFromString(s)
  File "/home/s1510032/anaconda2/lib/python2.7/site-packages/google/protobuf/internal/python_message.py", line 1091, in MergeFromString
    if self._InternalParse(serialized, 0, length) != length:
  File "/home/s1510032/anaconda2/lib/python2.7/site-packages/google/protobuf/internal/python_message.py", line 1117, in InternalParse
    new_pos = local_SkipField(buffer, new_pos, end, tag_bytes)
  File "/home/s1510032/anaconda2/lib/python2.7/site-packages/google/protobuf/internal/decoder.py", line 850, in SkipField
    return WIRETYPE_TO_SKIPPER[wire_type](buffer, pos, end)
  File "/home/s1510032/anaconda2/lib/python2.7/site-packages/google/protobuf/internal/decoder.py", line 791, in _SkipLengthDelimited
    raise _DecodeError('Truncated message.')
google.protobuf.message.DecodeError: Truncated message.

I have no idea what is wrong. Please let me know if you have any suggestions to solve this issue.

mrry
  • 125,488
  • 26
  • 399
  • 400
The Lazy Log
  • 3,564
  • 2
  • 20
  • 27

2 Answers2

3

For those who have the same issue. I had to look at the source code of TensorFlow to see how they write out the data with TFRecordWriter. I've realized that they actually write 8 bytes for length, 4 bytes for CRC check, it means that the first 12 bytes are for header. Because in TextSum code, the sample binary file seems to have only 8-byte header, that's why they use reader.read(8) to get the length of the data and read the rest as features.

My working solution is:

reader = open("test_data", 'rb')
len_bytes = reader.read(8)
reader.read(4) #ignore next 4 bytes
str_len = struct.unpack('q', len_bytes)[0]
example_str = struct.unpack('%ds' % str_len, reader.read(str_len))[0]
example_pb2.Example.FromString(example_str)
The Lazy Log
  • 3,564
  • 2
  • 20
  • 27
  • what operating system did you get this running on? On OSX 11 I'm running into issues before I even get here. I had to modify the `Train` method so the supervisor would wait for threads to stop. – Jordan Aug 29 '16 at 13:54
  • I am using running OSX. I also tried running it on Unix and it was running fine. But there is a small mistake in my solution, we should use `reader.read(4)` to skip 4 bytes in `data.py` instead of using `seek(12)`. I'm gonna update my post – The Lazy Log Aug 30 '16 at 00:22
2

I hope you have data_convert_example.py in your textsum directory. If not, you can find it in this post: https://github.com/tensorflow/models/pull/379/files

Use the python file to convert given binary toy data (file name : data in data directory into text format. python data_convert_example.py --command binary_to_text --in_file ../data/data --out_file ../data/result_text

You can see the actual text format you should give in the result_text format.

Prepare your data in that format and use the same python script to convert from text_to_binary and use the result for training/testing/eval.

TUMU. S
  • 41
  • 3