Deep networks on Cloud ML

Question

I am trying to train a very deep model on Cloud ML however i am having serious memory issues that i am not managing to go around. The model is a very deep convolutional neural network to auto-tag music.

The model for this can be found in the image below. A batch of 20 with a tensor of 12x38832x1 is inserted in the network.

The music was originally 465894x1 samples which was then split into 12 windows. Hence, 12x38832x1. When using the map_fn function each loop would have the seperate 38832x1 samples (conv1d).

Processing windows at a time yields better results than the whole music using one CNN. This was split prior to storing the data in TFRecords in order to minimise the needed processing during training. This is loaded in a queue with maximum queue size of 200 samples (ie 10 batches).

Once dequeue, it is transposed to have the 12 dimension first which then can be used in the map_fn function for processing of the windows. This is not transposed prior to being queued as the first dimension needs to match the batch dimension of the output which is [20, 50]. Where 20 is the batch size as the data and 50 are the different tags.

For each window, the data is processed and the results of each map_fn are superpooled using a smaller network. The processing of the windows is done by a very deep neural network which is giving me problems to keep as all the config options i am giving are giving me out of memory errors.

enter image description here

As a model i am using one similar to Census Tensorflow Model.

First and foremost, i am not sure if this is the best option since for evaluation a separate graph is built and not shared variables. This would require double the amount of parameters.

Secondly, as a cluster setup, i have been using one complex_l master, 3 complex_l workers and 3 large_model parameter servers. I do not know if am underestimating the amount of memory needed here.

My model has previously worked with a much smaller network. However, increasing it in size started giving me bad out of memory errors.

My questions are:

The memory requirement is big, but i am sure it can be processed on cloud ml. Am i underestimating the amount of memory needed? What are your suggestions about the cluster for such a network?
When using a train.server in the dispatch function, do you need to pass on the cluster_spec so it is used in the replica_device setter? Or does it allocate on it's own? When not using it, and setting tf.configProto of log placement, all the variables seem to be on the master worker. On the Census Example in the task.py this is not passed on. I can assume this is correct?
How does one calculate how much memory is needed for a model (rough estimate to select the cluster)?
Is there any other tensorflow core tutorial how to setup such big jobs? (other than Census)
When training a big model in distributed between-graph replication, does all the model need to fit on the worker, or the worker only does ops and then transmits the results to the PS. Does that mean that the workers can have low memory just for singular ops?

PS: With smaller models the network trained successfully. I am trying to deepen the network for better ROC.

Error

Questions coming up from on-going troubleshooting:

When using the replica_device_setter with the parameter cluster, i noticed that the master has very little memory and CPU usage and checking the log placement there are very little ops on the master. I checked the TF_CONFIG that is loaded and it says the following for the cluster field:

u'cluster': {u'ps': [u'ps-4da746af4e-0:2222'], u'worker': [u'worker-4da746af4e-0:2222'], u'master': [u'master-4da746af4e-0:2222']}

On the other hand, in the tf.train.Clusterspec documentation, it only shows workers. Does that mean that the master is not considered as worker? What happens in such case?

Error is it Memory or something else? EOF Error? enter image description here

Do you mind explaining what the dimensions of the input vector are (size=[batch, x, y, z], x=12, y=38832, z=1, what are x, y, and z)? — rhaertel80, Jul 22 '17 at 20:08
I want to wait to post a full answer until I have more info (see question above), but let me give a little info to start here in the comments. #2 is correct; the Census example has a bug that's been fixed but not pushed to github yet so you do need to pass cluster_spec to the replica_devce setter. Note that every worker has their own queue, so the size of your queue is relevant here. How big is it? — rhaertel80, Jul 22 '17 at 20:14
I updated the question. Batch is 20, x=window number, y = song samples and z = channels for conv1d. — Mark, Jul 22 '17 at 20:20
The batch size might be too big. At 10 batches. However, i have trained other models with the same batch size and same dimensions of initial data. If the queue size was the problem this should have given me a problem there too. The data was still [20, 12, 38832, 1] — Mark, Jul 22 '17 at 20:21
By passing the cluster_spec through (#2), you'll move the parameters off the workers, that should help. Then you can inspect the graph in TensorBoard to see the sizes of the variables to get an estimate for how much space is required for the parameter servers. The amount of data you can put on the workers is going to depend on your queue size and the size of the inputs (latter seems like it's only 9K, so probably a lot of headroom). Try passing through cluster_spec and see if that helps. — rhaertel80, Jul 22 '17 at 21:30
When using the cluster_spec, i still have memory problems. In tensorboard, the variables seem to be split up as they should be on the ps servers. I am not sure how to estimate the memory needed. I have a large amunt of tensors though. I have two of those graphs (since i need a seperate one for evaluation) and for each of them there is kernel parameters and gradients that all need to be saved. I think the gradients take alot of space. Since the stride is 1. It could be that 3 large model machines as parameter severs are not enough though. — Mark, Jul 22 '17 at 23:01
It might be the transpose function to change [20, 12, 38832, 1] to [12, 20, 38832, 1] is using a large amount of memory. However, again it worked with simpler models but same input data. — Mark, Jul 22 '17 at 23:14
To estimate the amount of ram, you look at the size of the tensors and variables and add them together. The variables for deep layers are generally small, e.g., 256*256*4 (bytes/float32) is 260K, even a really deep model (say, 100 layers), would only be 26 Mb. 20*12*38832 is ~10 Mb per batch. [This article][1] has info about computing the size of the weights of convolution kernels, and if you don't do parameter sharing, it's huge. [1] http://cs231n.github.io/convolutional-networks/ — rhaertel80, Jul 23 '17 at 00:41
With your reasoning, the presented model doesn't use that much memory since all kernels are 3x1 and with large depths. The gradients should only be double the amount of kernels. I am using the tf.layers.conv1d function. My assumption is that parameter sharing is done in that function. Although, it did give me more problems with memory with larger input. Is my assumption of parameter sharing in function incorrect? If so what is the suggested fix? — Mark, Jul 23 '17 at 02:57

Deep networks on Cloud ML

0 Answers0