TF/Keras ways to dynamically load in memory training batches of images and text?

Question

I am implementing a Handwritten Text Recognition system in Tensorflow using the Keras interface. In prediction phase, my system would take as an input the image of a line (.jpg format) and return as an output the corresponding transcription. So this is not a classification problem: it is more akin to encoding-decoding, where the input is an image and the output is text.

To sum up, my training samples are couples of

x: an image (read with cv2 from a .jpg file)
y: a transcription (read from a .txt file)

By the way this is how they stored in the filesystem:

In almost all the tutorials I have found, for the sake of simplicity all the training samples are loaded at once in memory at the beginning of the training process and then scanned using generator methods. With large datasets, however, loading all the training data in memory may lead to OOM before the training even begins.

To avoid this risk I would like to load batches in memory on the fly; so I am wondering if TF or Keras provide any built-in solution that loads the training data loading them progressively.

I have found that the Keras method tf.keras.preprocessing.image_dataset_from_directory does something pretty close to what I am thinking about, but apparently it only works for classification tasks (it requires your directory to have a sub-folder for each class; in my scenario, I do not have any classes).

I believe I can implement my own solution zipping an ImageDataGenerator and a... TextDataGenerator-something (not a real class, I should implement this one on my own). I found an example that zips and scans two ImageDataGenerators in the documentation.

However my scenario should be quite common, so I have a hard time believing that it is not already handled by any pre-existing TF/Keras classes or methods. If possible, I would avoid re-inventing the wheel; do you have any suggestions?

Thank you for your help. I really appreciate it!

You first need to decide if this is classification or regression, your output is discrete so it is classification, its not just multi-class classification, have you considered how exactly a neural network would predict text by directly looking at an image? Most models do this character by character, not in one shot as you imply. — Dr. Snoopy, Feb 11 '22 at 23:07
My output is discrete (it is a string), and my characters are indeed predicted one by one. My model encodes my images with a series of conv layers, and decodes them with LSTM layers and a final dense layer. However, since my samples are images of lines, they do not have a pre-defined classes in {'a', 'b', 'c'...}: the expected Y for an input X can be "Lorem Ipsum Dolor Sit Amet". So don't think I can use tf.keras.preprocessing.image_dataset_from_directory, which seems to require a filesystem structure based on classes (e.g., /data/train/a, /data/train/b, /data/train/c for classes a, b, c). — Andrea Rossi, Feb 12 '22 at 10:39

TF/Keras ways to dynamically load in memory training batches of images and text?

0 Answers0