Can i have any books about Azure Data Lake Internals?

Question

I dont wanna use the ADL and ADLA as a black box. I need to understand how the gears rotate underhood to use it in an efficient way.

Where i can find an information that describe internals:

how U-SQL query is processed
how parallelism is worked
how storage is organized in ADL at low level
how DB's storage is organized in ADL at low level (is it rowstore or columnstore)
how partitioning is organized
etc

There is exists a lot of books and whitepappers that describes RDBMS engine's internals. Does it exists for ADL/ADLA?

There are a lot of guys who works in Azure. Could you publish any drafts/whitepappers to use as is (unoficially).

score 7 · Accepted Answer · answered Feb 24 '17 at 11:33

7

Some of that information is available in presentations we have given. For example you can find some of these presentations on my slideshare account at: http://www.slideshare.net/MichaelRys.

To answer some of your questions above:

The current clustered index version of U-SQL tables are stored in your catalog folder structured as so called structured stream files. These are highly compressible, scaled out files that use a row-oriented structure with self-contained meta data and statistics (more detailed stats can be created). The table construct provides 2 level partitioning: addressable partitions and internal distribution schemes (HASH, RANGE etc). Both help with parallelization, although distribution schemes are more for performance while partition more for data lifecycle management. There is no limit on them, although the sweet spot is 1GB to 4GB per distribution bucket.

1 AU is basically 1 container. And ADLS is NOT HDFS architecturally but offers the WebHDFS API for compatibility.

answered Feb 24 '17 at 11:33

Michael Rys

6,684
15
23

big thanx again for awesome explanation! – churupaha Feb 24 '17 at 13:08
1

looks like ADL team is going to implement secondary indexes and columnstore support? I read it in the presentation https://www.slideshare.net/mobile/MichaelRys/tuning-and-optimizing-usql-queries-sqlpass-2016 Do you know when they plan to implement it? – churupaha Feb 27 '17 at 07:12
That is future roadmap and we currently do not have ETAs for these items. – Michael Rys Feb 27 '17 at 23:31
@MichaelRys "the sweet spot is 1GB to 4GB per distribution bucket", do you mean for the datalake anaytics table partition as well? If I need write/read huge amount of data into 1 partition, will it slow down the performance? What's the partition means in the background, does it means save on 1 node/machine? – lixinso Jun 21 '17 at 21:51
1

Addressable partitions (created with `PARTITIONED BY`) are mapped to files, so they can be much larger and we actually recommend to make them as large as you can for performance. – Michael Rys Jun 21 '17 at 23:35
@MichaelRys Thanks a lot for the details. One more question is, once I created thousand partitions(I try to partition one day data for one partition, for years), when I run a query job that need go through the table(without specify which partition data to use), the prepare phase is pretty slow, sometimes say failed(I remember something like timeout after 10+ minutes). Is there any limitations on how may partitions we can create? – lixinso Jun 22 '17 at 19:46
There is no limit on the creation. But currently there is a 3000 file limit (which maps to 3000 partitions if you only insert once per partition) when querying. Please file/upvote a request on http://aka.ms/adlfeedback and/or send me mail if you feel that is too low (we have an item on our backlog to address that, but customer evidence can help in prioritizing it). – Michael Rys Jun 23 '17 at 17:35

score 1 · Answer 2 · answered Feb 23 '17 at 19:42

1

This is a pretty broad question. I assume you've started with the existing documentation on ADLA and U-SQL? https://learn.microsoft.com/en-us/azure/data-lake-analytics/ https://msdn.microsoft.com/library/azure/mt591959

ADLA GA'd in November of 2016, compared to SQL Server in 1987 - that's a very apples and oranges comparison.

Maybe we can start with your specific questions?

answered Feb 23 '17 at 19:42

guyhay_MSFT

631
4
7

Yes sql server is very mature. But it have a lot of new festures. And we can read deep explanations from tech guys how these features wworks underhod before these features are released. Ok lets start from a concrete question. Can you explain, what is the u-sql table? Is it rowstore or columnstore? What is partition from all points of view (parallelizm, filtering, manageability etc). What is distribution? Is it the same as in the case of Azure DWH? How partitioning and distribution affect parallelizm. How big can be table/partition/distribution? – churupaha Feb 23 '17 at 21:46
What about skewed distributions from parallelizm point of view. What about statistics on columns? How usql engine choose a concrete execution plan (i mean "graph" with vertexes)... Is clustered index the same as in sql server? – churupaha Feb 23 '17 at 21:54
Is an Analytics Unit = YARN Container underhood? – churupaha Feb 24 '17 at 05:14
Is ADL = HDFS underhood or it is something else, that implement HDFS interface for compatiility with hadoop-world? – churupaha Feb 24 '17 at 05:17
@guyhay_MSFT you had several years, still no answer? – Alex Gordon Sep 06 '19 at 03:50
@churupaha i'm curious to know whether you adopted ADL and are using it now and if not why not – Alex Gordon Sep 06 '19 at 03:50

Can i have any books about Azure Data Lake Internals?

2 Answers2