1

I'm building a model to use with tf.slim which will run against the AVA dataset –– 32GB in size for about 256K JPG images. Against the full-res images, I created 20 sharded TFRecord files for training, each of size 1.54 GB.

During training, my pre-processing step will resize each image to (256,256,3) before extracting a random crop of (224,224,3). If I resize the JPG images before creating the TFRecord files, the file size shrinks to 28 MB.

Aside from extra time, is there any other problem with my methodology if I resize the JPG files BEFORE creating the TFRecords?

michael
  • 4,377
  • 8
  • 47
  • 73

2 Answers2

1

That seems to be a sensible approach in general for a large dataset.

From the TensorFlow docs: https://www.tensorflow.org/performance/performance_guide

Reading large numbers of small files significantly impacts I/O performance. One approach to get maximum I/O throughput is to preprocess input data into larger (~100MB) TFRecord files. For smaller data sets (200MB-1GB), the best approach is often to load the entire data set into memory. The document Downloading and converting to TFRecord format includes information and scripts for creating TFRecords and this script converts the CIFAR-10 data set into TFRecords.

Whether this will improve training performance (as in speed) may depend on your setup. In particular for a local setup with a GPU (see Matan Hugi's answer). (I haven't done any performance test myself)

The preprocessing only needs to happen once and you could run it in the cloud if necessary. It is more likely a bottleneck when your GPU becomes faster, e.g. you run it via Google's ML Engine with a more powerful GPU (unless you have access to a faster GPU yourself) or I/O becomes slower (e.g. involves network).

In summary some advantages:

  • preprocessing is only done once
  • preprocessing can be run in the cloud
  • reduces bottleneck (if there is any)

You have that additional step though.

In your case, 20x 28MB should easily fit into memory though.

de1
  • 2,986
  • 1
  • 15
  • 32
-1

No, there is no problem with this methodology, but if you're speaking performance-wise, you will not see any (significant) performance improvement from the creation of the resized TFRecords. Of course it will consume less disk space.

If I may recommend - if you have a decent storage device (doesn't have to be SSD) and you manage your data input pipeline correctly (with prefetching of enough next data samples), TFRecords do not offer performance improvement over dealing with single image files, and cause much less headache and overhead.

Matan Hugi
  • 1,110
  • 8
  • 16
  • I am curious why you think that there are no performance improvements? Are you loading everything into memory? How did you test the performance? – de1 Jan 10 '18 at 09:20
  • I think there is no performance from testing it in my own input pipeline. I'm not loading everything into memory, but I do prefetch the next ~1000 samples (in my case this is sufficient - this number is NOT general enough for every purpose). I tested the performance by comparing the number of input samples processed in a second. – Matan Hugi Jan 10 '18 at 10:04
  • Why do you think you'll not see a performance improvement? Without pre-resizing, at every batch and every epoch during training, the images will be loaded in at full size and then the computationally expensive operation of resizing will occur. Try timing a single epoch with images pre-resized to 64x64 versus with a dataloader that resizes 1024x1024 images to 64x64. – Austin Apr 03 '20 at 16:50