2

Is there a particular reason RedShift doesn't allow for key constraints? Check out the statement below:

Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints. and they should be declared if your ETL process or some other process in your application enforces their integrity.

Is this due to speed or something? There must be a reason here!

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
  • There is a long list of things that RedShift does not support, although the underlying code base (Postgres) does. Most are simply more cumbersome to implement in a columnar database than in a more traditional row-oriented one. – Gordon Linoff Mar 02 '16 at 21:23
  • Is that the only reason? Just because it would be difficult to implement? –  Mar 02 '16 at 21:27
  • . . Perhaps more precisely: Redshift doesn't support indexes. Although columnar databases can support indexes, there are good reasons why they are less effective, so not necessarily worthy of implementing. No indexes, means no unique indexes. No unique indexes means no unique constraints. No uniqueness, no foreign keys and no easy implementation of primary keys. – Gordon Linoff Mar 02 '16 at 21:32
  • Hm. I believe that Redshift does indeed support indexes. http://stackoverflow.com/questions/32088161/how-to-create-an-index-in-amazon-redshift It doesn't support key constraints though (like UNIQUE) –  Mar 02 '16 at 21:42
  • I find it difficult to believe Amazon didn't implement this because "it was too hard". There must be a computational reason due to redshift's architecture. –  Mar 02 '16 at 21:49
  • 1
    . . The documentation is pretty clear that Redshift does *not* support indexes: http://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-features.html. – Gordon Linoff Mar 03 '16 at 03:48

1 Answers1

1

I think the main reason is because checking the uniqueness is not realistic from the standpoint of the data loading performance. Since Amazon Redshift's architecture is designed to process data in parallel for scaling out, loaded data is distributed to multiple instances. Therefore, to support those constraints, it needs to check the uniqueness across instances for each row, which sounds significantly slow due to IO.

Masashi M
  • 2,679
  • 21
  • 22
  • hmm - I disagree, build a hashmap and check whether it exists –  Mar 03 '16 at 18:48
  • Yes, using hashmap should work if there is just one node. But, how about the case if there are multiple nodes like 16 computing nodes? In this case, Redshift cannot load the data in parallel with "COPY" command, and also needs to transfer loading data into all of nodes for checking the uniqueness. – Masashi M Mar 03 '16 at 19:29
  • How do people then deal with these constraints if RedShift doesn't support them? –  Mar 04 '16 at 19:39
  • There are some ways to avoid duplicate rows. 1. Upsert approach on data upload, 2. Check load commit history on COPY, 3. Check and remove duplicate rows periodically. – Masashi M Mar 15 '16 at 22:51