I went through the source for the CIFAR-100 inbuilt dataset and decided to create a compatible version for the FairFace dataset in order to be able to leverage the other built-in functions without many modifications everywhere once I convert FairFace into a structure very similar to CIFAR-100.
I did search around but was unable to find how the CIFAR-100 SQLite database was created - specifically how the images were converted into BLOB for storage. After a bit of trial and error, I tried doing it this way:
sample = getDatabyIndex(train_labels, index)
example = tf.train.Example(features=tf.train.Features(feature={
'image' : bytes_feature(sample[0].tobytes()),
'label' : int64_feature(sample[1])
}))
example = example.SerializeToString()
cur.execute("insert into examples('split_name','client_id','serialized_example_proto') values(?,?,?)", ('train', i, sqlite3.Binary(example)))
Executing this for each sample in the train data and similarly for test data. I am able to load it using this decoding method:
def parse_proto(tensor_proto):
parse_spec = {
'image': tf.io.FixedLenFeature(shape=(), dtype=tf.string),
'label': tf.io.FixedLenFeature(shape=(), dtype=tf.int64),
}
decoded_example = tf.io.parse_example(tensor_proto, parse_spec)
return collections.OrderedDict(
image=tf.reshape(tf.io.decode_raw(decoded_example['image'], tf.uint8), (224,224,3)),
label=decoded_example['label'])
What I noticed, however, is that the final sqlite.lzma compressed archive is 6.4 GB in size whereas the source archive for the dataset was 555 MB. I am guessing that due to the way I am storing the images, compression is not working as well as it could if they were stored in a more compatible manner. I see from the CIFAR-100 code that the images are loaded directly as FixedLenFeatures of shape (32,32,3) which means that they were stored as such but I have been unable to find a way to store my images as such. The only method that worked for me was the bytes_feature route.
What would be the best/recommended way to go about this?