Modify Tensorflow Code to place preprocessing on CPU and training on GPU

Question

I am reading this performance guide on the best practices for optimizing TensorFlow code for GPU. One suggestion they have is to place the preprocessing operations on the CPU so that the GPU is dedicated for training. To try to understand how one would actually implement this within an experiment (ie. learn_runner.run()). To further the discussion, I'd like to consider the best way to apply this strategy to the Custom Estimator Census Sample provided here.

The article suggests placing with tf.device('/cpu:0') around the preprocessing operations. However, when I look at the custom estimator the 'preprocessing' appears to be done in multiple steps:

Line 152/153 inputs = tf.feature_column.input_layer(features, transformed_columns) & label_values = tf.constant(LABELS) -- if I wrapped with tf.device('/cpu:0') around these two lines would that be sufficient to cover the 'preprocessing' in this example?
Line 282/294 - There is also a generate_input_fn and parse_csv function that are used to set up input data queues. Would it be necessary to place with tf.device('/cpu:0') within these functions as well or would that basically be forced by having the inputs & label_values already wrapped?

Main Question: Which of the above implementation suggestions is sufficient to properly place all preprocessing on the CPU?

Some additional questions that aren't addressed in the post:

What if the machine has multiple cores? Would 'cpu:0' be limiting?
The post implies to me that by wrapping the preprocessing on the cpu, the GPU would be automatically used for the rest. Is that actually the case?

Distributed ML Engine Experiment As a follow up, I would like to understand how this can be further adapted in a distributed ML engine experiment - would any of the recommendations above need to change if there were say 2 worker GPUs, 1 master CPU and a parameter server? My understanding is that the distributed training would be data-parallel asynchronous training so that each worker will be independently iterating through the data (and passing gradients asynchronously back to the PS) which suggests to me that no further modifications from the single GPU above would be needed if you train in this way. However, this seems a bit to easy to be true.

Tianjin Gu · Accepted Answer · 2017-09-02T06:10:08.083

2

MAIN QUESTION:

The 2 codes your placed actually are 2 different parts of the training, Line 282/294 in my options is so called "pre-processing" part, for it's parse raw input data into Tensors, this operations not suitable for GPU accelerating, so it will be sufficient if allocated on CPU.

Line 152/152 is part of the training model for it's processing the raw feature into different type of features.

'cpu:0' means the operations of this section will be allocated on CPU, but not bind to specified core. The operations allocated on CPU will run in multi-threads and use multi-cores.
If your running machine has GPUs, the TensorFlow will prefer allocating the operations on GPUs if the device is not specified.

edited Sep 02 '17 at 06:10

answered Sep 01 '17 at 14:32

Tianjin Gu

784
6
17

1

To clarify #2: it will be place on the GPU if a GPU kernel exists. Also note that if you are using a machine with more than 1 GPU, you will need to use explicit device statements or they will all get placed on /gpu:0 – rhaertel80 Sep 01 '17 at 14:59
Thanks! These comments help to address some of my clarification questions, but I would love to get confirmation on the main question which is "where does `with tf.device('cpu:0') need to be wrapped? Is it sufficient to just wrap this around the input layer in the model_fn (suggested in 1 Line 152/153) or does it also need to be placed elsewhere (such as part of the input_fn in line 282/294)? – reese0106 Sep 01 '17 at 15:03
I have updated my question to make this "main question" more clear – reese0106 Sep 01 '17 at 15:11
To be "safe", you should pin all of the ops you mentioned to cpu. – rhaertel80 Sep 01 '17 at 17:05
For line 282/294 which portions of those two functions need to be wrapped in the `with tf.device('cpu:0')`? @rhaertel80 suggests to pin all of the ops, but these lines are pointing to functions. Would it be sufficient to place `with tf.device('cpu:0')` as the first line of the function and then indent the rest of the function? It's evident to me how to pin any op that is contained within the model_fn, but it's not as clear to me that the input_fn's tf.device pin will carry over as expected through the experiment – reese0106 Sep 01 '17 at 19:39
I would like to update my question with the specific implementation for the future, but would like to confirm the implementation you are suggesting. – reese0106 Sep 01 '17 at 19:40
Following up here: when I placed the CPU preprocessing around the input_fn() and the parse_csv() function I received a minor uptick. However, if I wrap the "input_layer" in the with tf.device statement, my steps/sec dropped by more than 1/2. So it would appear that placing the input features portion of the graph onto the CPU is actually not beneficial. – reese0106 Sep 01 '17 at 20:42
The input_fn is already pinned to the CPU: https://github.com/tensorflow/tensorflow/blob/a2e1a5e0985d8d2c1ec3a3f246dad214600af21c/tensorflow/python/estimator/estimator.py#L587 – rhaertel80 Sep 01 '17 at 21:17

score 1 · Answer 2 · answered Sep 01 '17 at 15:03

1

The previous answer accurately describes device placement. Allow me to provide an answer to the questions about distributed TF.

The first thing to note is that, whenever possible, prefer a single machine with lots of GPUs to multiple machines with single GPUs. The bandwidth to parameters in RAM on the same machine (or even better, on the GPUs themselves) is orders of magnitude faster than going over the network.

That said, there are times where you'll want distributed training, including remote parameter servers. In that case, you would not necessarily need to change anything in your code from the single machine setup.

answered Sep 01 '17 at 15:03

rhaertel80

8,254
1
31
47

this makes sense to me. 1) which (if any or both) of the two proposed `with tf.device()` statements is needed to shift all preprocessing to CPU? 2) Your suggestion of a single machine with lots of GPUs does not really seem feasible with ML engine tiers as set up. The only GPU option is a BASIC_GPU and then any other GPU options require a custom configuration which can have a single `complex_model_l_gpu` with 8 GPUs, but it will still require using a distributed setup with masters, workers and parameter servers. Is there a way to use a single machine with lots of GPUs in ML engine? – reese0106 Sep 01 '17 at 15:10
(1) I'll comment on the post below (2) You only need 1 master and no other machine types in a custom setup; use the `complex_model_l_gpu`. – rhaertel80 Sep 01 '17 at 16:22

Modify Tensorflow Code to place preprocessing on CPU and training on GPU

2 Answers2