0

I've created several jobs for training CNN using Google Cloud ML Engine, each time job finished successfully with GPU error. The printed device placement included some GPU activity, but there was no GPU usage in job details/utilization.

Here is the command I use for create a job:

gcloud beta ml-engine jobs submit training fei_test34 --job-dir gs://tfoutput/joboutput --package-path trainer --module-name=trainer.main --region europe-west1 --staging-bucket gs://tfoutput --scale-tier BASIC_GPU -- --data=gs://crispdata/cars_128 --max_epochs=1 --train_log_dir=gs://tfoutput/joboutput --model=trainer.crisp_model_2x64_2xBN --validation=True -x

Here is the device placement log: log device placement . GPU error: GPU error detail

More info:

When I ran my code on Google Cloud ML Engine, the average speed for training using one Tesla K80 was 8.2 example/sec, the average speed without using GPUs was 5.7 example/sec, with image size 112x112. Same code I got 130.4 example/sec using one GRID K520 on Amazon AWS. I thought that using Tesla K80 should get faster speed. Also, I got the GPU error I posted yesterday. Additionally, in Compute Engine Quotas, I can see the usage of CPU > 0%, but the usage of GPUs remains 0%. I was wondering whether GPU is really working.

I am not familiar with cloud computing, so not sure I've provided enough information. Feel free to ask for more details.

I just tried setting to complex_model_m_gpu, the training speed is about the same as one GPU (cause my code is for one GPU), but there is more information in the log. Here is the copy of the log:

I successfully opened CUDA library libcudnn.so.5 locally

I successfully opened CUDA library libcufft.so.8.0 locally

I successfully opened CUDA library libcuda.so.1 locally

I successfully opened CUDA library libcurand.so.8.0 locally

I Summary name cross_entropy (raw) is illegal; using cross_entropy__raw_ instead.

I Summary name total_loss (raw) is illegal; using total_loss__raw_ instead.

W The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.

W The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

I successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

I Found device 0 with properties:

E name: Tesla K80

E major: 3 minor: 7 memoryClockRate (GHz) 0.8235

E pciBusID 0000:00:04.0

E Total memory: 11.20GiB

E Free memory: 11.13GiB

W creating context when one is currently active; existing: 0x39ec240

I successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

I Found device 1 with properties:

E name: Tesla K80

E major: 3 minor: 7 memoryClockRate (GHz) 0.8235

E pciBusID 0000:00:05.0

E Total memory: 11.20GiB

E Free memory: 11.13GiB

W creating context when one is currently active; existing: 0x39f00b0

I successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

I Found device 2 with properties:

E name: Tesla K80

E major: 3 minor: 7 memoryClockRate (GHz) 0.8235

E pciBusID 0000:00:06.0

E Total memory: 11.20GiB

E Free memory: 11.13GiB

W creating context when one is currently active; existing: 0x3a148b0

I successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

I Found device 3 with properties:

E name: Tesla K80

E major: 3 minor: 7 memoryClockRate (GHz) 0.8235

E pciBusID 0000:00:07.0

E Total memory: 11.20GiB

E Free memory: 11.13GiB

I Peer access not supported between device ordinals 0 and 1

I Peer access not supported between device ordinals 0 and 2

I Peer access not supported between device ordinals 0 and 3

I Peer access not supported between device ordinals 1 and 0

I Peer access not supported between device ordinals 1 and 2

I Peer access not supported between device ordinals 1 and 3

I Peer access not supported between device ordinals 2 and 0

I Peer access not supported between device ordinals 2 and 1

I Peer access not supported between device ordinals 2 and 3

I Peer access not supported between device ordinals 3 and 0

I Peer access not supported between device ordinals 3 and 1

I Peer access not supported between device ordinals 3 and 2

I DMA: 0 1 2 3

I 0: Y N N N

I 1: N Y N N

I 2: N N Y N

I 3: N N N Y

I Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0)

I Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:00:05.0)

I Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:00:06.0)

I Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:00:07.0)

I Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0)

I Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:00:05.0)

I Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:00:06.0)

I Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:00:07.0)

I 361

I bucket = crispdata, folder = cars_128/train

I path = gs://crispdata/cars_128/train

I Num examples = 240

I bucket = crispdata, folder = cars_128/val

I path = gs://crispdata/cars_128/val

I Num examples = 60

I {'flop': False, 'learning_rate_decay_factor': 0.005, 'train_log_dir': 'gs://tfoutput/joboutput/20170411_144221', 'valid_score_path': '/home/ubuntu/tensorflow/cifar10/validation_score.csv', 'saturate_epoch': 200, 'test_score_path': '', 'max_tries': 75, 'max_epochs': 10, 'id': '20170411_144221', 'test_data_size': 0, 'memory_usage': 0.3, 'load_size': 128, 'test_batch_size': 10, 'max_out_norm': 1.0, 'email_notify': False, 'skip_training': False, 'log_device_placement': False, 'learning_rate_decay_schedule': '', 'cpu_only': False, 'standardize': False, 'num_epochs_per_decay': 1, 'zoom_out': 0.0, 'val_data_size': 100, 'learning_rate': 0.1, 'grayscale': 0.0, 'train_data_size': 250, 'minimal_learning_rate': 1e-05, 'save_valid_scores': False, 'train_batch_size': 50, 'rotation': 0.0, 'val_epoch_size': 2, 'data': 'gs://crispdata/cars_128', 'val_batch_size': 50, 'num_classes': 2, 'learning_rate_decay': 'linear', 'random_seed': 5, 'num_threads': 1, 'num_gpus': 1, 'test_dir': '', 'shuffle_traindata': False, 'pca_jitter': 0.0, 'moving_average_decay': 1.0, 'sample_size': 128, 'job-dir': 'gs://tfoutput/joboutput', 'learning_algorithm': 'sgd', 'train_epoch_size': 5, 'model': 'trainer.crisp_model_2x64_2xBN', 'validation': False, 'tower_name': 'tower'}

I Filling queue with 100 CIFAR images before starting to train. This will take a few minutes.

I name: "train"

I op: "NoOp"

I input: "^GradientDescent"

I input: "^ExponentialMovingAverage"

I 128 128

I 2017-04-11 14:42:44.766116: epoch 0, loss = 0.71, lr = 0.100000 (5.3 examples/sec; 9.429 sec/batch)

I 2017-04-11 14:43:19.077377: epoch 1, loss = 0.53, lr = 0.099500 (8.1 examples/sec; 6.162 sec/batch)

I 2017-04-11 14:43:51.994015: epoch 2, loss = 0.40, lr = 0.099000 (7.7 examples/sec; 6.479 sec/batch)

I 2017-04-11 14:44:22.731741: epoch 3, loss = 0.39, lr = 0.098500 (8.2 examples/sec; 6.063 sec/batch)

I 2017-04-11 14:44:52.476539: epoch 4, loss = 0.24, lr = 0.098000 (8.4 examples/sec; 5.935 sec/batch)

I 2017-04-11 14:45:23.626918: epoch 5, loss = 0.29, lr = 0.097500 (8.1 examples/sec; 6.190 sec/batch)

I 2017-04-11 14:45:54.489606: epoch 6, loss = 0.56, lr = 0.097000 (8.6 examples/sec; 5.802 sec/batch)

I 2017-04-11 14:46:27.022781: epoch 7, loss = 0.12, lr = 0.096500 (6.4 examples/sec; 7.838 sec/batch)

I 2017-04-11 14:46:57.335240: epoch 8, loss = 0.25, lr = 0.096000 (8.7 examples/sec; 5.730 sec/batch)

I 2017-04-11 14:47:30.425189: epoch 9, loss = 0.11, lr = 0.095500 (7.8 examples/sec; 6.398 sec/batch)

Does this mean that GPUs are in use? If yes, any idea about the why there's a huge speed difference with Grid K520 when executing the same code?

Fei
  • 23
  • 4
  • Do you mind expanding the "jsonPayload" in the GPU error detail (there's a chance it doesn't have any more information, and if not, please confirm that here). – rhaertel80 Apr 10 '17 at 15:28
  • @rhaertel80 Here is the expanded jsonPayload info: jsonPayload: { lineno: 348, message: "name: Tesla K80", levelname: "ERROR", pathname: "/runcloudml.py", created: 1491314110.59608 } – Fei Apr 10 '17 at 15:51
  • The only problem I'm seeing so far is that a log message from the GPU or TensorFlow layer is logged at the wrong severity. The job completed successfully and the log device placement shows ops are being assigned to GPUs. So other than the fact that the string "name: Tesla K80" is incorrectly logged at level error does anything appear to be wrong? We don't provide any metrics about GPU utilization so I'm not sure what GPU usage data you are expecting but not finding. – Jeremy Lewi Apr 10 '17 at 16:40
  • @JeremyLewi I've added some more information about the problem. Thanks. – Fei Apr 11 '17 at 10:34

1 Answers1

0

So the log messages indicate that GPUs are available. To check whether GPUs are actually being used you can turn on logging of device placement to see which OPs are assigned to GPUs.

The Cloud Compute console won't show any utilization metrics related to Cloud ML Engine. If you look at the Cloud Console UI for your jobs you will see memory and CPU graphs but not GPU graphs.

Jeremy Lewi
  • 6,386
  • 6
  • 22
  • 37
  • 1
    GPUs are working fine. The training speed difference was caused by loading raw images from cloud bucket. I created a queue with all the image filenames and load images during training. I guess, on AWS, all the images are stored on the same server with GPUs, while on Google Cloud, the images are stored on the different servers which takes longer to load data? After I save all images as two binary files, the training speed increased from 8.5 img/s to 170 img/s. Is there any other suggestions for loading images or is there any way to store all images on the same server when using cloud bucket? – Fei Apr 21 '17 at 13:00
  • What happens if you add another queue after reading the images? I think this will cause TensorFlow to buffer the reading of the images from GCS which is the expensive operation. So I think you want to use a queue so that the loading happens in a thread asynchronously with respect to running training steps. – Jeremy Lewi Apr 21 '17 at 17:45