Optimize for Max Degree of Parallelizm in Azure Data Lake

Question

What are the guidelines, or where can we find guidelines for designing a system for optimal parallelism. I understand that the data is split out across the individual nodes and optimized to do so.

The data that I have in files currently has multiple customer, sites, products and users in it. I need to aggregate by customer, site, product which means that subsets of that data can be easily calculated in individual nodes and the brought back to a single node for output at the end of processing.

However I am not seeing that level of parallelism in the job graph. It is showing MDOP but not in a way that seems optimal. I have 4 different calculations that are done independently on customer, site, product. It is paralleling the 4 calculations, but doing it on the entire dataset. When in reality it should be able to fan it out where say 10 nodes get a 1 customer each, then each of those nodes could fan its calculations out to 4 more nodes. (note numbers here simply for example, scale of data is much larger).

How do I optimize either layout of files or logic of U-SQL to encourage more MDOP?

score 1 · Answer 1 · answered Mar 28 '16 at 19:39

1

Is your data coming from unstructured files or U-SQL tables? How much data are you processing (you need more than 250MB in a non-partitioned file to get parallelism).

If the data comes from files, you may be able to partition the files and use file sets and ROWCOUNT hints to get better parallelism.

If the data is stored in U-SQL tables, you can use the table partitioning and column clustering to influence the parallelization.

Also, at what level do you look at the parallelization? Note that the job graph will show you super vertexes (SV) that just show the group of vertices that do the same job. Each vertex within (if there is enough data) will execute in parallel on a slice of data. And - if possible - will be passed to other vertexes with minimal shuffling.

answered Mar 28 '16 at 19:39

Michael Rys

6,684
15
23

Yes they are unstructured files. We currently have 89 files per month, with 15 months of files = 1350 files which are between 48 and 227 MB each. We could combine those CVS into single files per month if that would increase efficiency. – Marcus.Kellermann Mar 28 '16 at 20:31
So somewhere around 1.5 billion record entries. You mention partitioning the files, Are you implying to put groups of companies into individuals files? Can you point to any information on File Sets, and ROWCOUNT hints? Although it sounds like loading this into a partitioned table might be the best solution. – Marcus.Kellermann Mar 28 '16 at 20:46
File sets are in the U-SQL documentation at http://aka.ms/usql_reference. Although you are already partitioning the data into files it seems. you can add a hint at the end of a query expression, e.g., SELECT * FROM @rs WHERE predicate OPTION(ROWCOUNT=50000); But using a partitioned table may be a better approach. – Michael Rys Mar 30 '16 at 18:59

Optimize for Max Degree of Parallelizm in Azure Data Lake

1 Answers1