Naive model-partitioning across several GPUs results in the workload moving from GPU to GPU during the forward and backward pass. At any instant, one GPU is busy. Here's the naive version.
with tf.device('/gpu:0'):
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
with tf.device('/gpu:1'):
model.add(Conv2D(128, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
with tf.device('/gpu:2'):
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(1500, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
We need sample code (template) that pipelines the work and keeps all GPUs busy by sending waves of batches and coordinates the work on each GPU (forward, gradient calc, parameter updates).
A hint is provided here via the use of a data_flow_ops.StagingArea
but a concrete example would be helpful.
I understand that data-partitioning (or data-parallel) is the way to go, but there exist use-cases where the model needs to be partitioned across CPU+GPUs.
Grateful for any pointer or sample (pseudo)code.