1

What are the best practices for large datasets conversion? In many of the cases I deal with there is always a first step where the input dataset is converted to a format that is consumable by the training (I deal with thousands of images). The conversion script was naively created to work locally (input directory - > output directory), and we run inside an estimator (blob storage - > blob storage). Based on the guidelines here https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-with-datasets#mount-vs-download it looks like is better to do download and then upload rather than mount, am I correct? A part from that what about parallel processing or distributed processing guidelines?

looking at this post: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-data-ingest-adf, it looks like they are suggesting to use batch for custom parallel processing. If so what is the advantage of using ADF? Why not use an AML pipeline with a first stage that runs batch?

user9427997
  • 43
  • 1
  • 4
  • I ran few experiments with FileDataset and I can confirm as far as performance is concerned, download is much faster than mount. – Arnab Biswas Feb 09 '22 at 10:54

1 Answers1

3

For dataset mount-vs-download, if you are processing all data in your dataset, then download will perform better than mount. For parallel processing, there is a pipeline step specialized in it: https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/machine-learning-pipelines/parallel-run

When to use ADF v.s. AzureML for data ingestion
Here is an article describe the pros and cons for these 2 approaches. You can use it to evaluate based on your scenario and needs.

May Hu
  • 501
  • 2
  • 3
  • I ran few experiments with FileDataset and I can confirm as far as performance is concerned, download is much faster than mount. – Arnab Biswas Feb 09 '22 at 10:48