1

Im completely new to Datawarehouse, OLAP and hive. I do have a single csv file containing training data about a online market like ebay (see Column Data). My task is to create a star schema in hive (via Data Analytics Studio, Spark, or whatever).

I scaffolded the dimension as followed:

Fact and Dimensions

How can i create those dimensions tables and especially generate a new ID for each line? My SELECT New_Guid(), listing_title, listing_subtitle, listing_type_code, start_price, buy_it_now_price, buy_it_now_listed_flag, qty_available_per_listing From auctions, which i found on another tutorial. But New_Guid() does not work at all in my Data Analytics Studio.

Thanks a lot!

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Lizzal
  • 89
  • 8

1 Answers1

1

Assuming no history in the dimensions, and leaving aside if good dimension design or not:

  1. For each required Dimension:

    • read the csv and extract relevant fields with distinct applied to temp_table
    • add a sequence number to each row using select (row_number() over()), Col1, Col2, col3, col4 from temp_table & persist to dimension_table
  2. For the fact table:

    • read the csv and extract relevant fields with applied to temp_table
    • add a sequence number to each row using select (row_number() over()), Col1, Col2, col3, col4 from temp_table and store in another temp_table_2
    • JOIN with appropriate statement from temp_table_2 to dimensuions looking up / selecting the sequence number giving temp_table_3
    • insert the keys only from temp_table_3 to fact_table
thebluephantom
  • 16,458
  • 8
  • 40
  • 83