Creating Star Schema (Dimensions and Fact Table) in Hadoop Hive from a single csv file

Question

Im completely new to Datawarehouse, OLAP and hive. I do have a single csv file containing training data about a online market like ebay (see Column Data). My task is to create a star schema in hive (via Data Analytics Studio, Spark, or whatever).

I scaffolded the dimension as followed:

How can i create those dimensions tables and especially generate a new ID for each line? My SELECT New_Guid(), listing_title, listing_subtitle, listing_type_code, start_price, buy_it_now_price, buy_it_now_listed_flag, qty_available_per_listing From auctions, which i found on another tutorial. But New_Guid() does not work at all in my Data Analytics Studio.

Thanks a lot!

How to get UUID or numerical surrogate key: https://stackoverflow.com/a/58625717/2700344 — leftjoin, Jul 11 '20 at 20:14

thebluephantom · Accepted Answer · 2020-07-12T07:37:49.737

Assuming no history in the dimensions, and leaving aside if good dimension design or not:

For each required Dimension:
- read the csv and extract relevant fields with distinct applied to temp_table
- add a sequence number to each row using select (row_number() over()), Col1, Col2, col3, col4 from temp_table & persist to dimension_table
For the fact table:
- read the csv and extract relevant fields with applied to temp_table
- add a sequence number to each row using select (row_number() over()), Col1, Col2, col3, col4 from temp_table and store in another temp_table_2
- JOIN with appropriate statement from temp_table_2 to dimensuions looking up / selecting the sequence number giving temp_table_3
- insert the keys only from temp_table_3 to fact_table

Creating Star Schema (Dimensions and Fact Table) in Hadoop Hive from a single csv file

1 Answers1