Partition By & Clustered & Distributed By in USql - Need to know their meaning and when to use them

Question

I can see that while creating table in USQL we can use Partition By & Clustered & Distributed By clauses.

As per my understanding partition will store data of same key (on which we have partition) together or closer (may be in same structured stream at background), so that our query will be more faster when we use that key in joins, filter.

Clustering is - I guess it stores data of those columns together or closer inside each partition.

And Distribution is some method like Hash or Round Robin - the way of storing data inside each partition. If you have integer column and you frequently query within some range , use range else use hash. If your data is not distributed equally then you may face data skew issue, so in that case use round robin.

Question 2: Please let me know whether my understanding is correct or not?

Question 1: There is INTO clause - I want to know how we should identify value for this INTO clause for DISTRIBUTION?

Question 3: Also want to know that which one is vertical partitioning and which one is horizontal?

Question 4: I don't see any good online document to learn these concepts with examples. If you know please send me links.

This is way too broad for StackOverflow. Try http://www.sqlservercentral.com/stairway/142480/ — Peter Bons, Nov 23 '17 at 12:47
[This document](https://msdn.microsoft.com/en-us/library/mt706196.aspx) has all the info you are asking for. — wBob, Nov 23 '17 at 15:50

score 2 · Answer 1 · answered Nov 29 '17 at 22:36

Peter and Bob have given you links to documentation.

To quickly answer your questions here:

Partitions and distributions both partition the data based on the partitioning scheme and both provide data scale out and partition elimination.

Partitions are optional and individually manageable for data life cycle management (besides giving you the ability to get partition elimination) and currently only support a value-based partition based on the same column values.

Each Partition then gets further partitioned based on the distribution scheme. Here you have different schemes (HASH, RANGE etc). The system decides on the number of distribution buckets based on some heuristic. In the case of HASH partitions, you can also specify the number of buckets with the INTO clause.

The clustering will then specify the order of the rows within a distribution bucket and allows you to further improve query performance (you can to a range scan instead of a full scan for example).

Vertical and horizontal partitioning are terms sometimes used to separate these two levels of partitioning. I try to avoid it, since it can be confusing to remember which one is which.

Partition By & Clustered & Distributed By in USql - Need to know their meaning and when to use them

1 Answers1

Linked