gcloud ML-engine long stalls when using multiprocessing or multithreading during training

Question

I am trying to train a convolutional model for computer vision in the google ml-engine on a BASIC_GPU tiered instance but the training is stalling for up to an hour at seemingly random time intervals. As can be seen in this picture of the cost function taken from tensorboard. Cost function plotted over time

The only obvious cause of this problem I can think of is the multithreading or multiprocessing functions I use (both produce the same problem and shall be referred to as parallelprocessing from now on). I use parallelprocessing to fetch and preprocess my images in parallel as google storage bucket latencies are around 100ms add to this some opencv pre processing and it can take up to 5/6 seconds per batch if done sequentially. The parallel fetch works by spawning workers from a function:

import threading as thr

def read_image_always_out(self, dirlist, que, seed):
    found_img = False
    np.random.seed(seed)
    number_of_files = len(dirlist)
    while not found_img:
        i = np.random.randint(number_of_files)
        seed = np.random.randint(self.maxInt)
        ex, se, found_img, _ = self.get_image_set(datapath, annopath, seed)

    que.put([ex, se])
    que.task_done()

def read_image_batch_thread(self, dirlist,seed):
    wait_for_thread = 5.0
    np.random.seed(seed)
    search_images, exemplars = [], []
    que = Queue.Queue((self.batch_size))
    process_list = []

    for i in range(0, self.batch_size):
        seed = np.random.randint(self.maxInt)
        p = thr.Thread(target=self.read_image,
                       args=(dirlist, que, seed))
        p.start()
        process_list.append(p)

    no_data = []

    for i in range(0, self.batch_size):
        try:
            images = que.get(wait_for_thread)
        except:
            print("timeout waiting for image")
            no_data.append(i)
        else:
            exemplars.append(images[0])
            search_images.append(images[1])

    for p in process_list:
        p.join(wait_for_thread)
    que.join()
    duration_image_get = time.time() - start_time
    return exemplars, search_images, duration_image_get, no_data

Whenever the training is not stalling the parallel fetch works like a charm and reduces image load times to around a second greatly improving the training speed of my model.

The kicker is that none of these problems show up when running the training locally. It seems like this is a bug specific to the ML-engine, or am I missing something? My search for restrictions to the ml-engine or solutions to this problem have come up dry.

Does anyone have experience with this issue and knows why it does not work or what else I could try? And is this issue a bug, or a restriction of the ML-engine?

I know there are some work arounds, like using bigger files with chunks of the training so I only have to download one file per batch and not multiples. Or using the tf.train.QueueRunner though I can't do my specific image pre-processing required in tensorflow api and have to preprocess all images. Both of these solutions require pre-processing of the images to work and that is something I want to avoid at all cost, as I have not yet established the best image sizes and don't want to make an imageset for every experiment I want to try out.

score 0 · Answer 1 · answered Apr 11 '17 at 18:46

Not enough information, so just guessing. You're likely hitting some degenerate case that makes your image fetching loop slow (e.g., not setting default timeouts, not handling certain failure conditions, waiting on all threads, starting a new thread for every image fetch).

The ideal fix is to move all of these operations to TensorFlow ops, including the reading and enqueueing. Those parts are very well-tested. Are you sure there's preprocessing that can't be done in TensorFlow? We commonly use TensorFlow ops for exactly this purpose. See here for example: https://github.com/tensorflow/models/blob/master/slim/preprocessing/vgg_preprocessing.py

gcloud ML-engine long stalls when using multiprocessing or multithreading during training

1 Answers1