I am currently working on a project using audio data. The first step of the project is to use another model to produce features for the audio example that are about [400 x 10_000] for each wav file and each wav file will have a label that I'm trying to predict. I will then build another model on top of this to produce my final result.
I don't want to run preprocessing every time I run the model, so my plan was to have a preprocessing pipeline that runs the feature extraction model and saves it into a new folder and then I can just have the second model use the saved features directly. I was looking at using TFRecords, but the documentation is quite unhelpful.
tf.io.serialize_tensor tfrecord
This is what I've come up with to test it so far:
serialized_features = tf.io.serialize_tensor(features)
feature_of_bytes = tf.train.Feature(
bytes_list=tf.train.BytesList(value=[serialized_features.numpy()]))
features_for_example = {
'feature0': feature_of_bytes
}
example_proto = tf.train.Example(
features=tf.train.Features(feature=features_for_example))
filename = 'test.tfrecord'
writer = tf.io.TFRecordWriter(filename)
writer.write(example_proto.SerializeToString())
filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
for raw_record in raw_dataset.take(1):
example = tf.train.Example()
example.ParseFromString(raw_record.numpy())
print(example)
But I'm getting this error:
tensorflow.python.framework.errors_impl.DataLossError: truncated record at 0' failed with Read less bytes than requested
tl;dr:
Getting the above error with TFRecords. Any recommendations to get this example working or another solution not using TFRecords?