1

AFAIK, in case of Relational Database on MPP hardware, the key to performance is a correct data distribution. While Dimensional Modeling is about query flexibility, you don't even know how the data will be queried (shuffled) in future.

For example, you have MPP Data Warehouse (Greenplum, Redshift, Synapse Analytics). For example, in 1-2 years, you expect your fact table will grow up to 10 billion of rows and you'll have 15-30 dimension tables of 10s millions of rows. How the data should be distributed accross DW nodes? Is there any common techniques? Like shard fact table and replicate dimension tables. Or should I minimize node amount in MPP DW?

I can bring specific use case, but I believe that the question arise from my misunderstanding of how Dimensional Modeling could be paired with scaling out.

VB_
  • 45,112
  • 42
  • 145
  • 293
  • is you r use case "compatible " with moving all of your dimension data in to your fact table? and having just 1 wide table? This can be the most efficient approach for some use cases on redhsift. – Jon Scott Dec 30 '19 at 08:12
  • @JonScott I think no since a) number of dimensions is pretty big, and fact table volume will grow significantly b) I have SCD Type 2 c) i will have multiple fact tables in the future, and want to have conformed dimensions, not sure how to achieve that with flat table – VB_ Dec 30 '19 at 09:34
  • That in theory, I haven't enough practice in this questions) – VB_ Dec 30 '19 at 09:58

1 Answers1

1

One technique I’ve seen applied with success in the past is: segment the fact table (e.g., by mod’ing the date key), and distribute all dimensions across all nodes. That way all joins can be done locally.

Note that even with large dimensions, their total size on disk should be a small fraction of the total needed for the fact table.

nsousa
  • 4,448
  • 1
  • 10
  • 15
  • thank you for your answer! What did you to when you was adding new nodes? Did you warehouse support some consistent hashing and automatic re-segmentation? Or that was done manually? – VB_ Dec 30 '19 at 09:37
  • it was done manually, iirc. But depending on your db of choicd you may have tools to do it automatically. – nsousa Dec 31 '19 at 10:06