tagged dimensions in data warehouse

Question

In my data warehouse I got a dimension where each record can have one or several tags. Are there any straight forward ways to model a dimension with different tags?

My initial idea was to store the comma separated list, and then use MySQL:s FIND_IN_SET() function to test for the presence of a tag. This allows me to slice the data.

But how should I model my scheme if I want to be able to group by tag?

Example:

Two products: Product A and product B. A is tagged with "foo", "bar". B is tagged with "bar", "baz".

Query: sales amounts, grouped by product tag. Group for tag "bar" must include sales for both products A and B:

foo -> sales for A
bar -> sales for B and A
baz -> sales for C

score 1 · Answer 1 · answered Apr 13 '11 at 19:56

For example you could store your data duplicated by tag.

If you have sales a=10 foo + bar and b=20 bar + baz you could store it like this.

    sales_by_tag_facts
    id, tag, sale_id, amount, is_primary_record
    1   foo  A        10     true
    2   bar  A        10     false
    3   bar  B        20     true
    4   baz  B        20     false

select sum(amount) from sales_by_tag_facts group by tag; // by tag
select sum(amount) from sales_by_tag_facts where is_primary_record=true; // without tag.

score 0 · Answer 2 · answered Apr 14 '11 at 08:03

Why not shift the heavy lifting from reporting to the transaction booking

You can add a new table called:

TagTotal that stores the total amount per tag and updates with with every transaction via a BEFORE(AFTER)_UPDATE_EACH trigger.

Extra fields/table

If you have 2 extra fields in the product table:

product.amount    decimal(10,2) running total of sales to date
product.last_sale date          date of the last sale

And the tag totals table looks like this.

tag.id primary autoinc    
tag.tagstr varchar(25)     
tag.amount decimal(10,2)
tag.date_from date   #dates to keep the running totals per month/week/day.
tag.date_to   date

pseudo code

CREATE TRIGGER ai_sales_each AFTER INSERT ON sales FOR EACH ROW
BEGIN
  UPDATE product SET product.amount = product.amount + new.amount,
    product.last_sale = sale.date
END

CREATE TRIGGER au_product_each AFTER UPDATE ON product FOR EACH ROW
BEGIN
  DECLARE AllTags VARCHAR(255);
  DECLARE ThisTag VARCHAR(25);

  IF old.tags <> new.tags THEN BEGIN
    reorganize the running totals in the tagtotal table.
  END; END IF;

  SET AllTags = new.tags;
  WHILE AllTags has more tags BEGIN
    SET ThisTag = NextTag(AllTags);
    UPDATE TagTotals SET amount = amount + new.amount
      WHERE TagTotals.tagstr = ThisTag 
      AND new.last_date BETWEEN TagTotals.date_from AND TagTotals.date_to;
  END; END WHILE;

Now if you want the sales totals per tag, you can just select from the tagtotals table.
Which will give instant results.

score -1 · Answer 3 · edited May 23 '17 at 12:16

-1

I would recommend against that, it goes against the normalization rules.
See I keep messing up 1NF
Or read up on the posts under the normalization tag.

Redesign suggestion for tables

If you make a tag and taglink table like so.

table tag
  id autoincrement integer primary index 
  tag_str varchar(20) index unique not null

table taglink
  id autoincrement integer primary index #gotta have an ID in my book.
  tag_id integer not null
  product_id integer not null

and you have a sales table something like this.

table product
  id autoincement integer primary index
  desc varchar(255)
  barcode, price, whatever ...

select statement to find products per tag

Than you can lookup articles that match a tag as follows.

select * from product
inner join taglink on (product.id = taglink.product_id)
inner join tag on (taglink.tag_id = tag.id)
where tag.tag_str in ('foo','bar','baz');

select statement to list tags per product

select tag_str from tag
inner join taglink on (taglink.tag_id = tag.id)
inner join product on (taglink.product_id = product.id)
where product.barcode = '4548215' or product.desc like 'OMG Po%'

Adding new tags

To add a new tag, just

insert into tag (id, tag_str) values (
  null /*remember autoincrement*/
  ,'Mytag');

linking a tag

To link a tag to a product

set @product_id = 10;
set @tag_id = 1;
...or...
select @product_id:= product.id from product where product.barcode = '1254851';
...
insert into taglink (id, product_id, tag_id) values (
  null /*autoinc id*/
  ,@product_id
  ,@tag_id );

You can link an unlimited number of tags to a product and you don't slow your queries down with costly FIND_IN_SET statements.
And you prevent duplicate tags.
And your database will be faster and smaller.

edited May 23 '17 at 12:16

Community

1
1

answered Apr 13 '11 at 20:20

Johan

74,508
24
191
319

How will you join 3 tables with millions of rows in each? Following the rules is not a goal. Denormalization is the usual approach in data warehousing. – Andrey Frolov Apr 13 '11 at 20:43
If you make the id fields a hash-type index, matching will occur in O(1) time. And all matches are equi-joins, so this will be faster than you think. – Johan Apr 13 '11 at 20:56
I'd suggest setting up a test-rig and timing this. Please please let us know the outcome, I'm really curious what the verdict is. – Johan Apr 13 '11 at 20:57
The tag table is small(ish), the other 2 can be very big. I have a sneaking suspicion the order of the inner joins is critical speed wise. – Johan Apr 13 '11 at 21:01
Doesn't matter how fast matching will be. Mysql is not able to do merge join or hash join. It will have to perfom nested loops. It will take ~10mln*20mln*1000*O(1) time. – Andrey Frolov Apr 13 '11 at 21:01
Are you sure? I do believe version 5.5 does hash joins. – Johan Apr 13 '11 at 21:02
Can't find any information abount hash join in mysql docs. By the way.. I forgot about random access during join. FYI: http://www.mysqlperformanceblog.com/2006/06/09/why-mysql-could-be-slow-with-large-tables/ – Andrey Frolov Apr 13 '11 at 21:07
Are you sure? I do believe version 5.5 does hash joins. You are describing a full table scan. The index will prevent a full table scan. That's the whole point of an index, at most it will do sqrt(10 mio) = 3162 * Sqrt(20 mio) = 4472 * sqrt(1000) = 31. And that's assuming MySQL uses binary trees which it does not, it is smarter than that, so no time the code and you will see. – Johan Apr 13 '11 at 21:09
You will have to do full table scan. Look at our target query - "sales amounts, grouped by product tag". You have to read all the data to do it, so you will prefer full scan instead of index scan. Just check this if you are not belive me :) Sorry, I am going to sleep. It is 1 AM here... We can continue tomorrow :) – Andrey Frolov Apr 13 '11 at 21:16
MySQL NDB cluster supports hash joins since 5.0.36. – Johan Apr 13 '11 at 21:21
Somehow missed your query subtleties, but if you store the sales totals in the product table, than the tag lookup should really only take constant time with hash joining. (provided you use NDB) {or MEMORY, but I'd like to see **that** RAM bank :-)} – Johan Apr 13 '11 at 21:24
To dadd to the discussion: the product dimension is a SCD with approx 2M rows right now. Each product has on average 2-3 tags. – jrydberg Apr 15 '11 at 06:11
What you're proposing looks very similar to this: http://forge.mysql.com/wiki/TagSchema#Recommended_Architecture – jrydberg Apr 15 '11 at 06:12

tagged dimensions in data warehouse

3 Answers3