0

I went through the source for the CIFAR-100 inbuilt dataset and decided to create a compatible version for the FairFace dataset in order to be able to leverage the other built-in functions without many modifications everywhere once I convert FairFace into a structure very similar to CIFAR-100.

I did search around but was unable to find how the CIFAR-100 SQLite database was created - specifically how the images were converted into BLOB for storage. After a bit of trial and error, I tried doing it this way:

sample = getDatabyIndex(train_labels, index)
example = tf.train.Example(features=tf.train.Features(feature={
  'image' : bytes_feature(sample[0].tobytes()),
  'label' : int64_feature(sample[1])
}))
example = example.SerializeToString()
cur.execute("insert into examples('split_name','client_id','serialized_example_proto') values(?,?,?)", ('train', i, sqlite3.Binary(example)))

Executing this for each sample in the train data and similarly for test data. I am able to load it using this decoding method:

def parse_proto(tensor_proto):
  parse_spec = {
    'image': tf.io.FixedLenFeature(shape=(), dtype=tf.string),
    'label': tf.io.FixedLenFeature(shape=(), dtype=tf.int64),
  }
  decoded_example = tf.io.parse_example(tensor_proto, parse_spec)
  return collections.OrderedDict(
            image=tf.reshape(tf.io.decode_raw(decoded_example['image'], tf.uint8), (224,224,3)),
            label=decoded_example['label'])

What I noticed, however, is that the final sqlite.lzma compressed archive is 6.4 GB in size whereas the source archive for the dataset was 555 MB. I am guessing that due to the way I am storing the images, compression is not working as well as it could if they were stored in a more compatible manner. I see from the CIFAR-100 code that the images are loaded directly as FixedLenFeatures of shape (32,32,3) which means that they were stored as such but I have been unable to find a way to store my images as such. The only method that worked for me was the bytes_feature route.

What would be the best/recommended way to go about this?

Zachary Garrett
  • 2,911
  • 15
  • 23
  • Do I follow the question correctly: the size of the SQLite database _before_ compression is 555 MiB and after compression is 6.5 GiB (compression made the file _much_ larger)? Could the question be expanded to include how LZMA compression is applied to the SQLite database file? – Zachary Garrett Dec 29 '21 at 22:57

1 Answers1

0

Without more information about LZMA compression is being applied its hard to answer about the size increase.

To directly use the same tf.io.FixedLenFeature as the CIFAR-100 dataset from tff.simulation.datasets.cifar100.load_data the tf.train.Example needs to be constructed usingint64_feature() for the 'image' key instead of bytes. This may require casting sample[0] to a different dtype (assuming it is a np.ndarray).

During decoding:

  1. First parse as an (N, M, 3) tensor with int64. From tensorflow_federated/python/simulation/datasets/cifar100.py#L31:

    'image': tf.io.FixedLenFeature(shape=(32, 32, 3), dtype=tf.int64),
    
  2. Cast to tf.unit8. From tensorflow_federated/python/simulation/datasets/cifar100.py#L37:

    image=tf.cast(parsed_features['image'], tf.uint8),
    

NOTE: Because of varint encoding used in protocol buffers (https://developers.google.com/protocol-buffers/docs/encoding#varints), using int64 isn't expected to add significant overhead for the serialized representation (at least less than 4x).

Zachary Garrett
  • 2,911
  • 15
  • 23