I am trying to train a convolutional model for computer vision in the google ml-engine on a BASIC_GPU tiered instance but the training is stalling for up to an hour at seemingly random time intervals. As can be seen in this picture of the cost function taken from tensorboard. Cost function plotted over time
The only obvious cause of this problem I can think of is the multithreading or multiprocessing functions I use (both produce the same problem and shall be referred to as parallelprocessing from now on). I use parallelprocessing to fetch and preprocess my images in parallel as google storage bucket latencies are around 100ms add to this some opencv pre processing and it can take up to 5/6 seconds per batch if done sequentially. The parallel fetch works by spawning workers from a function:
import threading as thr
def read_image_always_out(self, dirlist, que, seed):
found_img = False
np.random.seed(seed)
number_of_files = len(dirlist)
while not found_img:
i = np.random.randint(number_of_files)
seed = np.random.randint(self.maxInt)
ex, se, found_img, _ = self.get_image_set(datapath, annopath, seed)
que.put([ex, se])
que.task_done()
def read_image_batch_thread(self, dirlist,seed):
wait_for_thread = 5.0
np.random.seed(seed)
search_images, exemplars = [], []
que = Queue.Queue((self.batch_size))
process_list = []
for i in range(0, self.batch_size):
seed = np.random.randint(self.maxInt)
p = thr.Thread(target=self.read_image,
args=(dirlist, que, seed))
p.start()
process_list.append(p)
no_data = []
for i in range(0, self.batch_size):
try:
images = que.get(wait_for_thread)
except:
print("timeout waiting for image")
no_data.append(i)
else:
exemplars.append(images[0])
search_images.append(images[1])
for p in process_list:
p.join(wait_for_thread)
que.join()
duration_image_get = time.time() - start_time
return exemplars, search_images, duration_image_get, no_data
Whenever the training is not stalling the parallel fetch works like a charm and reduces image load times to around a second greatly improving the training speed of my model.
The kicker is that none of these problems show up when running the training locally. It seems like this is a bug specific to the ML-engine, or am I missing something? My search for restrictions to the ml-engine or solutions to this problem have come up dry.
Does anyone have experience with this issue and knows why it does not work or what else I could try? And is this issue a bug, or a restriction of the ML-engine?
I know there are some work arounds, like using bigger files with chunks of the training so I only have to download one file per batch and not multiples. Or using the tf.train.QueueRunner though I can't do my specific image pre-processing required in tensorflow api and have to preprocess all images. Both of these solutions require pre-processing of the images to work and that is something I want to avoid at all cost, as I have not yet established the best image sizes and don't want to make an imageset for every experiment I want to try out.