7

I'm getting my first exposure to data warehousing, and I’m wondering is it necessary to have foreign key constraints between facts and dimensions. Are there any major downsides for not having them? I’m currently working with a relational star schema. In traditional applications I’m used to having them, but I started to wonder if they were needed in this case. I’m currently working in a SQL Server 2005 environment.

UPDATE: For those interested I came across a poll asking the same question.

Jon Seigel
  • 12,251
  • 8
  • 58
  • 92
Garett
  • 16,632
  • 5
  • 55
  • 63
  • possible duplicate of [Is it good practice to have foreign keys in a datawarehouse (relationships)?](http://stackoverflow.com/questions/2690818/is-it-good-practice-to-have-foreign-keys-in-a-datawarehouse-relationships) –  Mar 21 '14 at 19:04

7 Answers7

16

Most data-warehouses (DW) do not have foreign keys implemented as constraints, because:

  • In general, foreign key constraint would trigger on: an insert into a fact table, any key-updates, and a delete from a dimension table.

  • During loading, indexes and constraints are dropped to speed-up the loading process, data integrity is enforced by the ETL application.

  • Once tables are loaded, DW is essentially read-only -- the constraint does not trigger on reads.

  • Any required indexes are re-built after the loading.

  • Deleting in a DW is a controlled process. Before deleting rows from dimensions, fact tables are queried for keys of rows to be deleted -- deleting is allowed only if those keys do not exists in any of fact tables.

Just in case, it is common to periodically run queries to detect orphan records in fact tables.

Damir Sudarevic
  • 21,891
  • 3
  • 47
  • 71
  • Thanks for the great feedback Damir. The fact that the system is essentially read only was what made me start to question the need for contraints. – Garett May 13 '10 at 14:35
  • 2
    Not sure if I should add here or to the similar question but...If integrity is an issue you can always right integrity functions or stored procedures that look for "orphaned" facts. (Rows where the foreign keys don't make sense). You can then clean those up after/during/before the next cycle of loads on your database. – Markus Oct 12 '10 at 18:23
  • "the constraint does not trigger on reads" While the potential to fail a constraint cannot happen on a read, because FK RI is typically implemented/enforced using an index, which can factor into the execution plan on reads - like all constraints, the optimizer can potentially use any constraints to influence the execution plan. – Cade Roux May 07 '12 at 14:18
  • Thanks Damir. Can you please provide an example for how data integrity can be enforced by an ETL application ? – Erran Morad May 18 '14 at 00:29
8

We use them, and we're happy with it.

Is it good practice to have foreign keys in a datawarehouse (relationships)?

There is overhead, but you can always disable the constraint during load and then re-enable it.

Having the constraint in place can catch ETL bugs and modelling defects.

Community
  • 1
  • 1
Cade Roux
  • 88,164
  • 40
  • 182
  • 265
3

I think in theory, you need that. But it depends on how you separate your data over database. If all of them in the same database, foreign key can help you because setting foreign key will help the database do selecting faster based on the indexing. If you share tables over many database, you need to check it on your application level

You can have your database check it for you but it can be slow. And generally, in data warehouse, we don't care about redundancy or integrity. We already have a lot of data and a few integrity and redundancy will not affect the general aggregate data

vodkhang
  • 18,639
  • 11
  • 76
  • 110
  • I mostly concur. Though I would have worded it that "Having Foreign Keys allows the database to choose the right index because it KNOWS that the relationships exist." So I think you should have them, but I don't think you need them. – MJB May 12 '10 at 14:06
  • Yeah, as I said, we should have them because of higher performance and integrity. But if we have to get rid of them, just going ahead:) – vodkhang May 12 '10 at 16:30
  • I agree that constraints (in general) can be exploited by optimizers to find better explain plans. That's the upshot of your answer. I disagree that FK's help the optimizer choose indexes. Can you point me to a resource which explains how that mechanism works? Or describe one where the optimizer would choose the wrong index without the FK but chooses the correct one with the FK in place? – Stephanie Page Sep 29 '11 at 19:37
  • http://explainextended.com/2009/10/15/constraints-and-the-optimizer-in-sql-server-foreign-key/ – Stephanie Page Sep 29 '11 at 19:45
2

I don't know about necessary, but I feel they are good for data integrity reasons. You want to make sure that your fact table is always pointing to a valid record in the dimension table. Even if you are sure this will happen, why not have the database validate the requirement for you?

jaltiere
  • 1,068
  • 6
  • 11
2

The reasons for using integrity constraints in a data warehouse are exactly the same as in any other database: to guarantee the integrity of the data. Assuming you and your users care about the data being accurate then you need some way of ensuring that it remains so and that business rules are being correctly applied.

nvogel
  • 24,981
  • 1
  • 44
  • 82
2

As far as I know FKs, speed up queries. Also, many BI solutions exploit them in their integration layer. So for me they are a must in DWs.

cs0815
  • 16,751
  • 45
  • 136
  • 299
1

Hope this thread is still active. My thinking is: for large fact tables with many dimensions and records, foreign keys will slow inserts and updates so that a fact table becomes too slow to load especially as it increases in size. Indexes are used for querying AFTER the table is loaded, so they can be disabled during inserts/updates and then rebuilt. The foreign key RELATION is important NOT the foreign key itself: this is really implicit in the ETL process. I have found that foreign keys make things TOO slow in the real world Datawarehouse. You need to use a VIRTUAL foreign key: the relation is their but not the constraint. If you damage the foreign key relations in a Datawarehouse you are doing something wrong. If you disable them during inserts and there is an mismatch or orphan, you won't be able to reenable them, so what's the point. The whole point of the DW is fast access and querying. Foreign keys make that impossible. Interesting debate: not easy to find this question on the Net Kev