2

I have been reviewing the AWS documentation, and cannot seem to understand how the distribution style works and how that data is stored on Redshift. I understand what a columnar storage database is, but when I read the documentation on the distribution style on Redshift it confuses me as to how the data is being stored on the nodes. The distribution style is stated as distributing newly loaded data by rows to the slices of the compute nodes.

For example, EVEN distribution style is defined as:

Even distribution

The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column. EVEN distribution is appropriate when a table does not participate in joins or when there is not a clear choice between KEY distribution and ALL distribution. EVEN distribution is the default distribution style.

So how exactly does the data get stored into a columnar storage if the data is being distributed by rows. Does the columnar storage go into affect after the data has been distributed to the compute nodes?

Here are the links to the AWS documentation discussing columnar storage and distribution styles:

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
stochasticcrap
  • 350
  • 3
  • 16

1 Answers1

3

Each Amazon Redshift cluster has multiple nodes. Each node is divided into slices, with allocated CPU and disk storage.

Each column within a table is stored separately, so a table with 3 columns requires at least 3 blocks per slice. This is what makes Redshift columnar -- each column is stored separately.

Each block is 1 MB in size and is independently compressed.

See: Why does a table in my Amazon Redshift cluster consume more disk storage space than expected?

The Distribution Key determines which rows are stored on which slices. Remember -- each slice has its own storage for each column in a table, but the rows are distributed between slices. (Except for a Distribution of ALL, which puts every row into every node.)

Within the storage for a particular column on a slice, the data is sorted based upon the Sort Key.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • This makes sense now. The key piece of info was "each slice has its own storage for each column in a table, but the rows are distributed between slices". So if there were two tables with 3 columns each, then each slice would need at least 6 blocks per slice correct? – stochasticcrap Oct 19 '17 at 23:56
  • Exactly! It's not efficient for a tiny table, but data warehouses often have billions of rows in a table and the most efficient method is to distribute the data for parallel processing. – John Rotenstein Oct 20 '17 at 02:29