2

I know that hive cannot create surrogate keys or is rather difficult. I want to understand how companies have implemented dimensional modeling in their warehouse.

One way I can think of is leaving the dimension details as is in fact. Then move the distinct of dimension to a different table. But then how are scd1 and scd2 handled. I have checked talks by Kimball on cloudera and I still don't understand how this works.

Ravi
  • 1,811
  • 1
  • 18
  • 31
  • _how companies have implemented dimensional modeling in their warehouse_ In a non-hive warehouse, like SQL Server for example, you simply use an `identity`. – Nick.Mc Jul 13 '21 at 11:11

2 Answers2

0

There are two ways of handling this problem in Hive.

The first does not directly answer your question, and that is to use natural keys instead of surrogates. While surrogates are more convenient and performant, since you're using Hive I'm guessing that performance isn't one of your major criteria, so the cost of using natural keys will just be in the extra lines of code you have to write to cater for compound keys.

The second way is to use Hive's windowing functions to calculate the surrogate. I don't have a Hive environment handy to test this query, but the surrogate would look something like:

(select max(surrogate_key_column) from dimension_table)
+ row_number() over (order by 1)
Ron Dunn
  • 2,971
  • 20
  • 27
  • That can be done. But I would like to know if you people are implementing this in their company. Are you using the second approach in your work? Also using NK approach would not work if we are creating fact from multiple tables. NKs are going to overlap. – Ravi Aug 12 '18 at 10:51
  • I've not used that approach in Hive, but I've used it extensively in other data warehouse environments. I don't understand why multiple tables would prevent natural keys being used ... surely, for example, CustomerA in Table1 would be the same as CustomerA in Table2? If not, then the natural key should also include the concept represented by the separate table. If you can share some more details of your tables it would help to understand the problem better. – Ron Dunn Aug 13 '18 at 06:00
0

As far as I know, In version 3.0, Hive supports the surrogate keys on ACID tables

https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/using-hiveql/content/hive_surrogate_keys.html

Summarised from the link:


The SURROGATE_KEY UDF generates a unique Id for every row that you insert into a table.

Example usage:

-Create a table

CREATE TABLE students_v2 
(`ID` BIGINT DEFAULT SURROGATE_KEY(),
 row_id INT,
 name VARCHAR(64), 
 dorm INT, 
 PRIMARY KEY (ID) DISABLE NOVALIDATE);

-Insert data, which automatically generates surrogate keys for the primary keys.

INSERT INTO students_v2 (row_id, name, dorm) SELECT * FROM students;

-Take a look at the surrogate keys.

SELECT * FROM students_v2;
Nick.Mc
  • 18,304
  • 6
  • 61
  • 91
QuangPH
  • 126
  • 7
  • 1
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/low-quality-posts/29381692) – Lin Du Jul 13 '21 at 01:14
  • Thank you for all your comment and edit, @slideshowp2, Nick. I will improve in my next answers. – QuangPH Aug 10 '21 at 03:54