I have a GCP cloud compute VM, which is an n1-standard-16
, with 4 P100 GPUs attached, and a solid state drive for storing data. I'll refer to this as "the VM".
I've previously used the VM to train a tensorflow based CNN. I want to move away from this to using AI Platform so I can run multiple jobs simultaneously. However I've run into some problems.
Problems
When the training is run on the VM I can set a batch size of 400, and the standard time for an epoch to complete is around 25 minutes.
When the training is running on a complex_model_m_p100
AI platform machine, which I believe to be equivalent to the VM, I can set a maximum batch size of 128, and the standard time for an epoch to complete is 1 hour 40 minutes.
Differences: the VM vs AI Platform
The VM uses TF1.12 and AI Platform uses TF1.15. Consequently there is a difference in GPU drivers (CUDA 9 vs CUDA 10).
The VM is equipped with a solid state drive, which I don't think is the case for AI platform machines.
I want to understand the cause of the reduced batch size, and decrease the epoch times on AI Platform to comparable levels to Glamdring. Has anyone else run into this issue? Am I running on the correct kind of AI Platform machine? Any advice would be welcome!