0

I've wrote a Tensorflow program, that reads 128x128 images. The program runs kind of OK on my laptop,which I use to check if the code is ok. The 1st programm is bases on MNIST Tutorial , the 2nd ist using MNIST example for convNN. when I try to run them on GPU, I get the following error message:

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16384,20000]
 [[Node: inputLayer_1/weights/Variable/Adam_1/Assign = Assign[T=DT_FLOAT, _class=["loc:@inputLayer_1/weights/Variable"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](inputLayer_1/weights/Variable/Adam_1, inputLayer_1/weights/Variable/Adam_1/Initializer/Const)]]

from what I've been reading online. I have to use batches in my Testing, and here 's how feeding is working:

...........................................
    batchSize  = 40
img_height = 128
img_width  = 128


# 1st function to read images form TF_Record
def getImage(filename):
    # convert filenames to a queue for an input pipeline.
    filenameQ = tf.train.string_input_producer([filename],num_epochs=None)

    # object to read records
    recordReader = tf.TFRecordReader()

    # read the full set of features for a single example
    key, fullExample = recordReader.read(filenameQ)

    # parse the full example into its' component features.
    features = tf.parse_single_example(
        fullExample,
        features={
            'image/height': tf.FixedLenFeature([], tf.int64),
            'image/width': tf.FixedLenFeature([], tf.int64),
            'image/colorspace': tf.FixedLenFeature([], dtype=tf.string,default_value=''),
            'image/channels':  tf.FixedLenFeature([], tf.int64),
            'image/class/label': tf.FixedLenFeature([],tf.int64),
            'image/class/text': tf.FixedLenFeature([], dtype=tf.string,default_value=''),
            'image/format': tf.FixedLenFeature([], dtype=tf.string,default_value=''),
            'image/filename': tf.FixedLenFeature([], dtype=tf.string,default_value=''),
            'image/encoded': tf.FixedLenFeature([], dtype=tf.string, default_value='')
        })

    # now we are going to manipulate the label and image features
    label = features['image/class/label']
    image_buffer = features['image/encoded']
    # Decode the jpeg
    with tf.name_scope('decode_jpeg',[image_buffer], None):
        # decode
        image = tf.image.decode_jpeg(image_buffer, channels=3)

        # and convert to single precision data type
        image = tf.image.convert_image_dtype(image, dtype=tf.float32)
    # cast image into a single array, where each element corresponds to the greyscale
    # value of a single pixel.
    # the "1-.." part inverts the image, so that the background is black.
    image=tf.reshape(1-tf.image.rgb_to_grayscale(image),[img_height*img_width])
    # re-define label as a "one-hot" vector
    # it will be [0,1] or [1,0] here.
    # This approach can easily be extended to more classes.
    label=tf.stack(tf.one_hot(label-1, numberOFclasses))
    return label, image

train_img,train_label = getImage(TF_Records+"/train-00000-of-00001")
validation_img,validation_label=getImage(TF_Records+"/validation-00000-of-00001")
# associate the "label_batch" and "image_batch" objects with a randomly selected batch---
# of labels and images respectively
train_imageBatch, train_labelBatch = tf.train.shuffle_batch([train_img, train_label], batch_size=batchSize,capacity=50,min_after_dequeue=10)

# and similarly for the validation data
validation_imageBatch, validation_labelBatch = tf.train.shuffle_batch([validation_img, validation_label],
                                                batch_size=batchSize,capacity=50,min_after_dequeue=10)

........................................................

    sess.run(tf.global_variables_initializer())

# start the threads used for reading files
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess,coord=coord)

# feeding function
def feed_dict(train):
    if True :
        #img_batch, labels_batch= tf.train.shuffle_batch([train_label,train_img],batch_size=batchSize,capacity=500,min_after_dequeue=200)
        img_batch , labels_batch = sess.run([ train_labelBatch ,train_imageBatch])
        dropoutValue = 0.7
    else:
        #   img_batch,labels_batch = tf.train.shuffle_batch([validation_label,validation_img],batch_size=batchSize,capacity=500,min_after_dequeue=200)
        img_batch,labels_batch = sess.run([ validation_labelBatch,validation_imageBatch])
        dropoutValue = 1
    return {x:img_batch,y_:labels_batch,keep_prob:dropoutValue}

for i  in range(max_numberofiteretion):
    if i%10 == 0:#Run a Test
        summary, acc = sess.run([merged,accuracy],feed_dict=feed_dict(False))
        test_writer.add_summary(summary,i)# Save to TensorBoard
    else: # Training
      if i % 100 == 99:  # Record execution stats
        run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
        run_metadata = tf.RunMetadata()
        summary, _ = sess.run([merged, train_step],
                              feed_dict=feed_dict(True),
                              options=run_options,
                              run_metadata=run_metadata)
        train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
        train_writer.add_summary(summary, i)
        print('Adding run metadata for', i)
      else:  # Record a summary
        summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True))
        train_writer.add_summary(summary, i)

# finalise
coord.request_stop()
coord.join(threads)
train_writer.close()
test_writer.close()

..................................................

The validation folder contained 2100 files, so yes I understand that's too much,

I found this suggestion

config = tf.ConfigProto()
config.gpu_options.allocator_type = 'BFC'
with tf.Session(config = config) as s:......

but this didn't solved the issue! any idea how may I solve this ?

Jonathan
  • 3
  • 4
Engine
  • 5,360
  • 18
  • 84
  • 162

1 Answers1

2

The problem seems to be that everything in the graph is done on GPU. You should use the CPU resources for preprocessing functions and the rest of the graph on GPU. So make the input processing functions like getImage() and queues to be run on CPU instead of GPU. Basically when GPU is working on tensors the CPU should be filling the input pipeline queues, so both CPU and GPU are efficiently used. This is explained in the tensorflow performance Guide :

Placing preprocessing on the CPU can result in a 6X+ increase in samples/sec processed, which could lead to training in 1/6th of the time. https://www.tensorflow.org/performance/performance_guide

For example you can create a function get_batch to be run on CPU like this:

def get_batch(dataset):
      with tf.device('/cpu:0'):
          'File Name Queue'
          'Get image function implementation'
          'Shuffle batch to make batches'
     return image, labels
train_imageBatch, train_labelBatch = get_batch('train_dataset')
validation_imageBatch, validation_labelBatch = get_batch('valid_dataset')

Also check the below link on how to switch between testing and validation when using queues:Tensorflow Queues - Switching between train and validation data. Your code should be like:

# A bool tensor to figure out whether in training loop or tesing loop
_is_train = tf.placeholder(dtype=tf.bool, name='is_train') 

# Select train or test database based on the _is_train tensor
images = tf.cond(_is_train, lambda:train_imageBatch, lambda:validation_imageBatch)
labels = tf.cond(_is_train, lambda:train_labelBatch, lambda:validation_labelBatch)

train_op = ...
...
for step in num_steps:

    # each step
    summary, _ = sess.run([merged, train_step], fead_dict={_is_train:True}
    ...
    if (validate_step)
      summary, acc = sess.run([merged,accuracy],feed_dict={_is_train:False)
      ...

For implementation of get_batch, you can see this example from tensorflow: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/how_tos/reading_data/fully_connected_reader.py .

Vijay Mariappan
  • 16,921
  • 3
  • 40
  • 59
  • thanks for replying, I still have need some clarifications. would u please explain how to use _is_train in example I mean the calling the sess.run function. 2nd how does the queue work in this case in this case, in my example I run the validation&train batch each time to make sure I call new data , I don't understand how it'll work in ur example. Thanks a lot for your help – Engine Jun 21 '17 at 08:35
  • I've tried already but the feeding didn't work. but the resource is solved. Thanks – Engine Jun 21 '17 at 09:33