Huge size of TF records file to store on Google Cloud

Question

I am trying to modify a tensorflow project so that it becomes compatible with TPU.

For this, I started with the code explained on this site.

Here COCO dataset is downloaded and first its features are extracted using InceptionV3 model. I wanted to modify this code so that it supports TPU.

For this, I added the mandatory code for TPU as per this link.

Withe TPU strategy scope, I created the InceptionV3 model using keras library and loaded model with ImageNet weights as per existing code.

Now, since TPU needs data to be stored on Google Cloud storage, I created a tf records file using tf.Example with the help of this link.

Now, I tried to create this file in several ways so that it will have the data that TPU will find through TFRecordDataset.

At first I directly added image data and image path to the file and uploaded it to GCP bucket but while reading this data, I realized that this image data is not useful as it does not contain shape/size information which it will need and I had not resized it to the required dimension before storage. This file size became 2.5GB which was okay. Then I thought lets only keep image path at cloud, so I created another tf records file with only image path, then I thought that this may not be an optimized code as TPU will have to open the image individually resize it to 299,299 and then feed to model and it will be better if I have image data through .map() function inside TFRecordDataset, so I again tried, this time by using this link, by storing R, G and B along with image path inside tf records file.

However, now I see that the size of tf records file is abnormally large, like some 40-45GB and ultimately, I stopped the execution as my memory was getting filled up on Google Colab TPU.

The original size of COCO dataset is not that large. It almost like 13GB.. and from that the dataset is being created with only first 30,000 records. so 40GB looks weird number.

May I know what is the problem with this way of feature storage? Is there any better way to store image data in TF records file and then extract through TFRecordDataset.

score 3 · Answer 1 · answered Aug 03 '20 at 17:43

3

I think the COCO dataset processed as TFRecords should be around 24-25 GB on GCS. Note that TFRecords aren't meant to act as a form of compression, they represent data as protobufs so it can be optimally loaded into TensorFlow programs.

You might have more success if you refer to: https://cloud.google.com/tpu/docs/coco-setup (corresponding script can be found here) for converting COCO (or a subset) into TFRecords.

Furthermore, we have implemented detection models for COCO using TF2/Keras optimized for GPU/TPU here which you might find useful for optimal input pipelines. An example tutorial can be found here. Thanks!

answered Aug 03 '20 at 17:43

Allen Wang

281
1
4

Thats great.. thanks a lot for the answer.. Since I didnt get any reply, I went ahead with loading images on GCP and then read it while extracting features which worked.. but still I prefer the image data to be stored directly on cloud rather than images itself. I will definitely go through the links, once I finish off my remaining work on the model. – Pallavi Aug 04 '20 at 06:51
Glad that's working out for you! For clarification, does this mean your input pipeline is like this? raw image -> load into TF -> pre-process ? If, so you will probably run into a bottleneck where the preprocessing takes longer than TPU training for each batch. If you feel that TPU training is slower than expected, the above links might help! – Allen Wang Aug 04 '20 at 17:29
Actually I am not doing training at this level. I just needed image features to be extracted from Inception model for 30k images and it was taking almost 3-4 hrs without TPU. But now I have completed it within few minutes with TPU. Now next code is related to actual training part using these features and few more inputs. – Pallavi Aug 05 '20 at 05:52

Huge size of TF records file to store on Google Cloud

1 Answers1