5

I am currently using AWS Redshift service to store data. The data size is about to hit 100% of disk space.

  1. Will adding nodes and changing from Single-node to Multi-nodes increase the disk size?

  2. Is moving from dc1.xlarge to bigger nodes such as dc1.8xlarge the only way to increase the disk space?

  3. If I move to Multi-nodes, will the data be split or just mirrored so that both nodes will have the same data?

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
Aung Myint Thein
  • 390
  • 5
  • 17

1 Answers1

9

Redshift is a distributed columnar data warehouse solution. The key here is "distributed". Unlike traditional databases, Redshift is designed to scale out by adding nodes to the cluster. Adding nodes adds disk space as well as computing horsepower. To answer your questions:

  1. Will adding nodes and changing from Single-node to Multi-nodes increase the disk size?

    Generally speaking, yes. When storing data in Redshift, you should choose a distribution key (column or set of columns) that will evenly distribute your data across different nodes. As a general principle, you should use the same set of columns for your distribution key across all your tables. Note that Tables configured to use a distribution style of all will get replicated across all nodes; limit using dist style all to dimension tables only.

  2. Is moving from dc1.xlarge to bigger nodes such as dc1.8xlarge the only way to increase the disk space?

    No; see answer to question 1 above. There are different types of nodes that you can choose from depending on your requirement. DC1 are compute optimized nodes; they have smaller but faster SSD drives. DS1 nodes will provide you with significantly higher disk space per node.

  3. If I move to Multi-nodes, will the data be split or just mirrored so that both nodes will have the same data?

    See answer to Q1 above - when you add nodes to your Redshift cluster, Redshift will re-distribute your data across all nodes as specified in the distribution style for each of your tables.

PS: I would highly recommend reading through Redshift documentation. Start at Are You a First-Time Amazon Redshift User?

References: Choosing a Data Distribution Style

Alan W. Smith
  • 24,647
  • 4
  • 70
  • 96
DotThoughts
  • 759
  • 5
  • 8
  • Thanks! This really explains a lot of my questions. Recently I tried to use Distribution key (eg: domain) for a big table. When I tried to explain the "select * from table_name" query, the table with distribution key has more width than the normal table. do you have any idea why .. ? – Aung Myint Thein Jan 20 '17 at 15:19
  • Thats what Redshift is estimating as the average size of the row. Try running analyze on the table and see if it changes to something more reasonable. – DotThoughts Jan 24 '17 at 23:09
  • 1
    Wanted to update this answer a bit for 2023. Now a dc2.large has 160G per node vs an ra3.xlplus will give you 4TB as starting point. So different node types will yeild different storage amounts now. – Justin Fortier May 30 '23 at 15:55