On data parallel training, I guess the GPU instance is not necessarily efficient for parameter servers because parameter servers only keep the values and don't run any computation such as matrix multiplication.
Therefore, I think the example config for Cloud ML Engine (using CPU for parameter servers and GPU for others) below has good cost performance:
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
workerType: standard_gpu
parameterServerType: standard_cpu
workerCount: 3
parameterServerCount: 4
Is that right?