Unexplained RAM usage and potential memory leak when using tf.data.TFRecordDataset

Question

Background

We are relatively new to TensorFlow. We are working on a DL problem involving a video dataset. Due to the volume of data involved, we decided to preprocess the videos and store the frames as jpegs in TFRecord files. We then plan to use tf.data.TFRecordDataset to feed the data to our model.

The videos have been processed into segments, each segment consisting of 16 frames, in a serialised tensor. Each frame is a 128*128 RGB image which was encoded as jpeg. Each serialised segment is stored along with some metadata as a serialised tf.train.Example in the TFRecords.

TensorFlow version: 2.1

Code

Below is the code we are using to create the tf.data.TFRecordDataset from the TFRecords. You can ignore the num and file fields.

import os
import math
import tensorflow as tf

# Corresponding changes are to be made here
# if the feature description in tf2_preprocessing.py
# is changed
feature_description = {
    'segment': tf.io.FixedLenFeature([], tf.string),
    'file': tf.io.FixedLenFeature([], tf.string),
    'num': tf.io.FixedLenFeature([], tf.int64)
}


def build_dataset(dir_path, batch_size=16, file_buffer=500*1024*1024,
                  shuffle_buffer=1024, label=1):
    '''Return a tf.data.Dataset based on all TFRecords in dir_path
    Args:
    dir_path: path to directory containing the TFRecords
    batch_size: size of batch ie #training examples per element of the dataset
    file_buffer: for TFRecords, size in bytes
    shuffle_buffer: #examples to buffer while shuffling
    label: target label for the example
    '''
    # glob pattern for files
    file_pattern = os.path.join(dir_path, '*.tfrecord')
    # stores shuffled filenames
    file_ds = tf.data.Dataset.list_files(file_pattern)
    # read from multiple files in parallel
    ds = tf.data.TFRecordDataset(file_ds,
                                 num_parallel_reads=tf.data.experimental.AUTOTUNE,
                                 buffer_size=file_buffer)
    # randomly draw examples from the shuffle buffer
    ds = ds.shuffle(buffer_size=1024,
                    reshuffle_each_iteration=True)
    # batch the examples
    # dropping remainder for now, trouble when parsing - adding labels
    ds = ds.batch(batch_size, drop_remainder=True)
    # parse the records into the correct types
    ds = ds.map(lambda x: _my_parser(x, label, batch_size),
                num_parallel_calls=tf.data.experimental.AUTOTUNE)
    ds = ds.prefetch(tf.data.experimental.AUTOTUNE)
    return ds


def _my_parser(examples, label, batch_size):
    '''Parses a batch of serialised tf.train.Example(s)
    Args:
    example: a batch serialised tf.train.Example(s)
    Returns:
    a tuple (segment, label)
    where segment is a tensor of shape (#in_batch, #frames, h, w, #channels)
    '''
    # ex will be a tensor of serialised tensors
    ex = tf.io.parse_example(examples, features=feature_description)
    ex['segment'] = tf.map_fn(lambda x: _parse_segment(x),
                              ex['segment'], dtype=tf.uint8)
    # ignoring filename and segment num for now
    # returns a tuple (tensor1, tensor2)
    # tensor1 is a batch of segments, tensor2 is the corresponding labels
    return (ex['segment'], tf.fill((batch_size, 1), label))


def _parse_segment(segment):
    '''Parses a segment and returns it as a tensor
    A segment is a serialised tensor of a number of encoded jpegs
    '''
    # now a tensor of encoded jpegs
    parsed = tf.io.parse_tensor(segment, out_type=tf.string)
    # now a tensor of shape (#frames, h, w, #channels)
    parsed = tf.map_fn(lambda y: tf.io.decode_jpeg(y), parsed, dtype=tf.uint8)
    return parsed

Problem

While training our model crashed because it ran out of RAM. We investigated by running some tests and profiling the memory with memory-profiler with the --include-children flag.

All these tests were run (CPU only) by simply iterating over the dataset multiple times with the following code:

count = 0
dir_path = 'some/path'
ds = build_dataset(dir_path, file_buffer=some_value)
for itr in range(100):
    print(itr)
    for itx in ds:
        count += 1

The total size of the subset of TFRecords we are working on now is ~ 3GB We would prefer to use TF2.1, but we can test with TF2.2 as well.

According to TF2 docs, file_buffer is in bytes.

Trial 1: file_buffer = 500*1024*1024, TF2.1

Trial 2: file_buffer = 500*1024*1024, TF2.2 This one seems much better.

Trial 3 file_buffer = 1024*1024, TF2.1 We don't have the plot with us, but the RAM maxes out at ~ 4.5GB

Trial 4 file_buffer = 1024*1024, TF2.1, but prefetch set to 10

I think there is a memory leak here, as we can see the memory usage gradually builds up over time.

All trials below were run for only 50 iterations, instead of 100

Trial 5 file_buffer = 500*1024*1024, TF2.1, prefetch = 2, all other AUTOTUNE values were set to 16.

Trial 6 file_buffer = 1024*1024, rest same as above

Questions

How is the file_buffer value affecting the memory usage, comparing Trail 1 and Trail 3, file_buffer was reduced 500 times, but memory usage only dropped by half. Is the file buffer value really in bytes?
Trial 6's parameters seemed promising, but trying to train the model with the same failed, as it ran out of memory again.
Is there a bug in TF2.1, why the huge difference between Trial 1 and Trial 2?
Should we continue using AUTOTUNE or revert to constant values?

I would be happy to run more tests with different parameters. Thanks in advance!

Unexplained RAM usage and potential memory leak when using tf.data.TFRecordDataset

0 Answers0