Is making multiple shards of your data with multiple threads minimize the training time?

Question

My main issue is : I have 204 GB training tfrecord file that has 2 million images, and 28GB for validation tf.record file, of 302900 images. it takes 8 hour to train one epoch and this will take 33 day for training. I want to speed that by using multiple threads and shards but I am little bit confused about couple of things.

In tf.data.Dataset API there is shard function , So in the documentation they mentioned the following about shard function :

Creates a Dataset that includes only 1/num_shards of this dataset.

This dataset operator is very useful when running distributed training, as it allows each worker to read a unique subset.

When reading a single input file, you can skip elements as follows:

d = tf.data.TFRecordDataset(FLAGS.input_file)
d = d.shard(FLAGS.num_workers, FLAGS.worker_index)
d = d.repeat(FLAGS.num_epochs)
d = d.shuffle(FLAGS.shuffle_buffer_size)
d = d.map(parser_fn, num_parallel_calls=FLAGS.num_map_threads)

Important caveats:

Be sure to shard before you use any randomizing operator (such as shuffle). Generally it is best if the shard operator is used early in the dataset pipeline. >For example, when reading from a set of TFRecord files, shard before converting >the dataset to input samples. This avoids reading every file on every worker. The >following is an example of an efficient sharding strategy within a complete >pipeline:

d = Dataset.list_files(FLAGS.pattern)
d = d.shard(FLAGS.num_workers, FLAGS.worker_index)
d = d.repeat(FLAGS.num_epochs)
d = d.shuffle(FLAGS.shuffle_buffer_size)
d = d.repeat()
d = d.interleave(tf.data.TFRecordDataset,
             cycle_length=FLAGS.num_readers, block_length=1)

d = d.map(parser_fn, num_parallel_calls=FLAGS.num_map_threads)

and these are my question:

1- Is there any relation between the number of tf.records files and number of shards? is number of shards(worker) depend on the number of CPU you have it, or the number of tf.records files you have ? and how I create it, , just by setting the number of shards to specific number? , or we need to split the files to multiple files and then set specific number of shards. note the number of worker referred to number of shards

2- what is the benefit of creating multiple tf.records file? some people said here is is related with when you need to shuffle you tf.records in better way but with Shuufle method exists in tf.Dataset API we don't need to do that, and other people said here it is only to split your data to smaller sizes parts. My question Do I need to as first step to split my tf.records file to multiple files

3- Now we come to num_threads in the map function ( num_paralle_calls in new version of tensorflwo) should be the same as number of shards you have it . When I searched I found some people say that if you have 10 shards, and 2 threads, each thread will take 5 shards.

4- what about about the d.interleave function , I know how it works as was mention in this example. But again I missed the connection num_threads, cycle length for example

5- If I want to use multiple GPU should I use the shards? as mentioned in the accepted comment here

as a summary I am confused about the relation between ( number of tf.records files, num_shards(workers), cyclic length, num_thread(num_parallel_calls). And what is the better situation to create in order to minimize the training time for both cases ( using multiple GPUs, and using single GPU)

score 9 · Accepted Answer · answered Apr 24 '18 at 18:01

9

I am a tf.data developer. Let's see if I can help answer your questions.

1) It sounds like you have one large file. To process it using multiple workers, it makes sense to split it into multiple smaller files so that the tf.data input pipeline can process different files on different workers (by the virtue of applying the shard function over the list of filenames).

2) If you do not split your single file into multiple records each worker will have to read the entire file, consuming n times more IO bandwidth (where n is the number of workerS).

3) The number of threads for the map transformation is independent of the number of shards. Each shard will be processed by num_parallel_calls on each worker. In general, it is reasonable to set num_parallel_calls proportional to the number of available cores on the worker.

4) The purpose of the interleave transformation is to combine multiple dataset (e.g. read from different TFRecord files) into a single dataset. Given your use case, I don't think you need to use it.

5) If you want to feed multiple GPUs, I would recommend going with the first option listed in the comment you referenced as it is the easiest. The shard-based solution requires creating multiple pipelines on each worker (on for each GPU).

To minimize training time (for single or multiple GPUs, you want to make sure that your input pipeline is producing data faster than the GPU(s) can process it. Performance tuning of an input pipeline is discussed here.

answered Apr 24 '18 at 18:01

jsimsa

315
2
6

So what is the best way to split? 1) Split the original dataset ( which in in CSV file) into 10 splits then create 10 tfrecord files. 2) Split the created one tfrecord file into multiple files through tf.Dasata.shard.If this option is better, how I should deal with shared dataset within Keras. Should I create 10 iterate? ( one iterator per each shard)?. I mean , I will not be able to save like 10 tfrecords file like option one, I will have one tfrecord file, but I just can get one shard at a time. – W. Sam Aug 05 '18 at 17:16
can you please help me in this post? It also related to tf.data https://stackoverflow.com/questions/51775619/best-input-pipeline-for-large-dataset – W. Sam Aug 09 '18 at 23:20
1

What is the optimal way to split depends on your hardware environment and the computation done by your pipeline. How many workers do you have, how many CPUs each, how much RAM, where is your data stored, what is the interconnect between the workers and the data storage? Does `parser_fn` do any non-trivial computation? – jsimsa Aug 13 '18 at 16:44
My files stored in harddisk in the server. I have 12 cpu cores. I have 46G RAM. I set the worker to zero, because this is the only way that my data generator work. And the comment by @kmader in this post,'https://stackoverflow.com/questions/46135499/how-to-properly-combine-tensorflows-dataset-api-and-keras ' mention the same thing about the worker should equal to zero if you use your own data generator. – W. Sam Aug 14 '18 at 01:32
So the data is stored locally and you only have one worker? – jsimsa Aug 16 '18 at 17:58
I am confused because your initial question includes `FLAGS.num_workers` and now you are telling me you have no workers. – jsimsa Aug 16 '18 at 22:07
I am referring to https://stackoverflow.com/questions/51775619/best-input-pipeline-for-large-dataset post. sorry for confusion. Because you answered me here and there. – W. Sam Aug 16 '18 at 22:13
1

link above is 404 – Ray Tayek Nov 28 '19 at 05:55

Is making multiple shards of your data with multiple threads minimize the training time?

1 Answers1