How to handle a dimension table with duplicate ids containing slightly different values in data warehouse?

Question

I'm building out the data warehouse at my company and I've encountered a situation where I am pulling in data with slight variations in name but tied to the same ID. This is obviously a problem because my dimension table should only have one record per ID

for example:

+======+===================+
|  id  |      name         |
+======+===================+
|  185 | AAAA              |
+------+-------------------+
|  185 | AAAB              |
+------+-------------------+
|  197 | XXXA              |
+------+-------------------+
|  197 | XXXB              |
+------+-------------------+
|  197 | XXXC              |
+------+-------------------+

As you can see, the ID field should be tied to one unique value but there are strings that have slight variations but tied to the same ID. One thought was to normalize the strings but we would lose some of the metadata. Additionally, I should note that we are using Redshift which is why the unique id constraint is not being enforced. What would be the best solution to this issue?

How many such variations of name possible per unique_Id? do you need to query on name as well? — Red Boy, May 18 '18 at 05:27
does the name change over time? are you interested in the old names? is this a data quality issue? — Jon Scott, May 18 '18 at 06:58
I have at most 5 variations per ID and it is only an issue with around 10% of the total population of IDs. I don't foresee a situation where we would actually need to query on name. — 0xPeter, May 18 '18 at 16:46
The issue is this marketing platform we pull from. It is not an ideal platform to work with and this is just one of the many shortcomings of it. There are several affiliates running the same campaigns and they append some metadata to the overall campaign name which is where we get these variations, so this extra metadata isn't really too important but it would be nice to find a solution that allowed us to keep it. — 0xPeter, May 18 '18 at 16:48

demircioglu · Answer 1 · 2018-05-18T21:06:42.897

Keep the latest name in the dimension table and create a secondary table for "history" just in case you need the other names in the future. I had a similar situation with a user dimension and implemented the way I described it. You can choose a rule to decide which one is going to be in the dimension table

With your example, the two tables will look like this

dim table
+======+===================+
|  id  |      name         |
+======+===================+
|  185 | AAAB              |
+------+-------------------+
|  197 | XXXC              |
+------+-------------------+

dim_hist table

+======+========+=================+
|  id  | dim_id |    name         |
+======+======+===================+
|  101 | 185  | AAAA              |
+------+------+-------------------+
|  102 | 197  | XXXA              |
+------+------+-------------------+
|  103 | 197  | XXXB              |
+------+------+-------------------+

Using id from dim table you can join two tables and access other names

I figured out a satisfactory approach already, but this is a good extension to that approach. — 0xPeter, May 18 '18 at 21:06

score 0 · Answer 2 · answered May 18 '18 at 21:13

I don't know if this is the most optimal solution, but it is the solution we chose to accept for our situation. Essentially, I perform a self join on the dimension table and add a column which selects the shortest string per ID. Since typically, the variations between each value is due to data that is appended to the base string, the shortest string should return the base string which is the most important part of the field we are looking for.

Here is the sql code I wrote to perform this:

create table tmp_dim_offers as (
  -- create subquery
  with normalized_dim_offers as (
      select
        t1.id_offer,
        t1.dim_offer_name,
        min(t2.dim_offer_name) as normalized_offer_name
      from dim_offers as t1
        join dim_offers as t2 on t1.id_offer = t2.id_offer
      group by 1, 2
      order by t1.id_offer
  )
  -- select distinct ids and normalized offer name
  select distinct
    normalized_dim_offers.id_offer              as id_offer,
    normalized_dim_offers.normalized_offer_name as dim_offer_name
  from normalized_dim_offers
  order by normalized_dim_offers.id_offer
);

-- drop existing dim_offers table and replace with new normalized table
begin;
alter table dim_offers rename to dim_offers_to_delete;
alter table tmp_dim_offers rename to dim_offers;
drop table dim_offers_to_delete cascade;
commit;

one small suggestion. Always use column names, instead of ordinal names for ORDER BY, GROUP BY etc. — Venkataraman R, Oct 25 '19 at 04:05

score 0 · Answer 3 · answered Oct 25 '19 at 03:59

How about adding a new column in your dimension table and filling it with UUID value? The UUID column acts as a primary key.

In addition, that is the way I keep track of historical data. I think your problem is that someone had modified the records in the source table through time. By using UUID as pkey, we don't need to override the record, so we can keep versioning through time.

How to handle a dimension table with duplicate ids containing slightly different values in data warehouse?

3 Answers3