-1

I have a GCP cloud compute VM, which is an n1-standard-16, with 4 P100 GPUs attached, and a solid state drive for storing data. I'll refer to this as "the VM".

I've previously used the VM to train a tensorflow based CNN. I want to move away from this to using AI Platform so I can run multiple jobs simultaneously. However I've run into some problems.

Problems

When the training is run on the VM I can set a batch size of 400, and the standard time for an epoch to complete is around 25 minutes.

When the training is running on a complex_model_m_p100 AI platform machine, which I believe to be equivalent to the VM, I can set a maximum batch size of 128, and the standard time for an epoch to complete is 1 hour 40 minutes.

Differences: the VM vs AI Platform

  • The VM uses TF1.12 and AI Platform uses TF1.15. Consequently there is a difference in GPU drivers (CUDA 9 vs CUDA 10).

  • The VM is equipped with a solid state drive, which I don't think is the case for AI platform machines.

I want to understand the cause of the reduced batch size, and decrease the epoch times on AI Platform to comparable levels to Glamdring. Has anyone else run into this issue? Am I running on the correct kind of AI Platform machine? Any advice would be welcome!

James
  • 3,957
  • 4
  • 37
  • 82
  • AI Platform gives regular 100GB persistent disk by default, which might impact the IO throughput if you access dataset on local disk. Can you try it on a GCE VM with regular persistent disk to see if that's the bottleneck please? – Guoqing Xu Feb 26 '20 at 19:37
  • This is not a programming issue but one about server configuration which should, instead, be asked on https://serverfault.com/ - [What topics can I ask about here?](https://stackoverflow.com/help/on-topic) – Rob Mar 05 '20 at 13:25
  • @GuoqingXu where is that documented, I would love to know. I have 1TB in training data and I don't think AI Platform Training will work for me, just getting 18GB of files from GCS on the start of my container fails at 46%, it just gives up. – Marc May 16 '20 at 04:48
  • Can you please send an email describing the issues to cloudml-feedback@google.com? We will make sure these issues will get resolved. – Guoqing Xu May 17 '20 at 05:12

2 Answers2

1

Could be a bunch of stuff. There's two ways to go about, making the VM look more like AI Platform:

export IMAGE_FAMILY="tf-latest-gpu" # 1.15 instead of 1.12
export ZONE=...
export INSTANCE_NAME=...

gcloud compute instances create $INSTANCE_NAME \
  --zone=$ZONE \
  --image-family=$IMAGE_FAMILY \
  --image-project=deeplearning-platform-release \
  --maintenance-policy=TERMINATE \
  --metadata="install-nvidia-driver=True"

n and then attach 4 GPUs after that.

... or making AI Platform looking more like the VM: https://cloud.google.com/ai-platform/training/docs/machine-types#gpus-and-tpus, because you are using a Legacy Machine right now.

Frederik Bode
  • 2,632
  • 1
  • 10
  • 17
  • Thanks for your suggestion! I ran a comparison job on a machine configured like this: `masterType: n1-highcpu-16 masterConfig: acceleratorConfig: count: 4 type: NVIDIA_TESLA_P100`, but I still see a ~4x slowdown! – James Feb 26 '20 at 12:27
  • 4x slowdown woudl suggest your only using one of your 4 GPUs. You can try to compare VM-2GPU and AIPlatform-2GPU and see if you get a 2x slowdown – Frederik Bode Feb 26 '20 at 12:30
  • to fix that you might need to add some kind of DistributedTraining strategy, AIPlatform might have set a different default one from standard tensorflow from some weird reason. – Frederik Bode Feb 26 '20 at 12:34
  • OK, this is weird. With 4 GPUs : 160 seconds per epoch, 2 GPUs: 160 seconds per epoch, 1 GPU: 80 seconds per epoch! Any ideas how this could happen? – James Feb 26 '20 at 14:28
  • Hmm damn, that's still 2x slowdown, meaning it's still something else. Have you tried a VM with 1 GPU, is that als 80 sec /epoch? My guess is that AIPlatforms using a Distribution Strategy that worsens your performance by 2x instead of improving it by 2x that's why you see 4x worse performance – Frederik Bode Feb 26 '20 at 15:11
0

After following the advice of @Frederik Bode and creating a replica VM with TF 1.15 and associated drivers installed I've managed to solve my problem.

Rather than using the multi_gpu_model function call within tf.keras, it's actually best to use a distributed strategy and run the model within that scope.

There is a guide describing how to do it here.

Essentially now my code looks like this:

mirrored_strategy = tf.distribute.MirroredStrategy()

with mirrored_strategy.scope():

    training_dataset, validation_dataset = get_datasets()

    model = setup_model()

    # Don't do this, it's not necessary!
    #### NOT NEEDED model = tf.keras.utils.multi_gpu_model(model, 4)

    opt = tf.keras.optimizers.Adam(learning_rate=args.learning_rate)

    model.compile(loss='sparse_categorical_crossentropy',
              optimizer=opt,
              metrics=['accuracy'])

    steps_per_epoch = args.steps_per_epoch
    validation_steps = args.validation_steps

    model.fit(training_dataset, steps_per_epoch=steps_per_epoch, epochs=args.num_epochs,
                validation_data=validation_dataset, validation_steps=validation_steps)

I setup a small dataset so I could rapidly prototype this.

With a single P100 GPU the epoch time average to 66 seconds.

With 4 GPUs, using the code above, the averagge epoch time was 19 seconds.

James
  • 3,957
  • 4
  • 37
  • 82