0

I have to track values for a combination of different resources everyday. So a table to do this might look like:

CREATE TABLE `data` (
  `id` INT UNSIGNED NULL PRIMARY KEY AUTO_INCREMENT,
  `datetime` DATETIME NOT NULL,
  `res1` INT UNSIGNED NOT NULL,
  `res2` INT UNSIGNED NOT NULL,
  `res3` INT UNSIGNED NOT NULL,
  `res4` INT UNSIGNED NOT NULL,
  `res5` INT UNSIGNED NOT NULL,
  `value` DECIMAL(10,0) NOT NULL,
  UNIQUE INDEX `datetime_res1_to_res5` (`datetime`, `res1`, `res2`, `res3`, `res4`, `res5`)
)

where res1 through to res5 are foreign keys to their respective tables.

This table will contains a lot of rows--will easily crack 20 million.

What I'm curious of is if I should put the combination of foreign keys into a separate table, such that I have two tables like so:

CREATE TABLE `data` (
  `id` INT UNSIGNED NULL PRIMARY KEY AUTO_INCREMENT,
  `datetime` DATETIME NOT NULL,
  `superKeys_id` INT UNSIGNED NOT NULL,
  `value` DECIMAL(10,0) NOT NULL,
  UNIQUE INDEX `datetime_superKeys_id` (`datetime`, `superKeys_id`)
)

CREATE TABLE `superKeys` (
  `id` INT UNSIGNED NULL PRIMARY KEY AUTO_INCREMENT,
  `res1` INT UNSIGNED NOT NULL,
  `res2` INT UNSIGNED NOT NULL,
  `res3` INT UNSIGNED NOT NULL,
  `res4` INT UNSIGNED NOT NULL,
  `res5` INT UNSIGNED NOT NULL,
  UNIQUE INDEX `res1_to_res5` (`res1`, `res2`, `res3`, `res4`, `res5`)
)

where data.superKeys_id is a foreign key to superKeys.id.

This will significantly reduce the size of the table. But I'm not sure if it might be a bad idea for reasons I do not know. Obviously selects will require a join to get a breakdown of data, which would add little more overhead, but I shouldn't think this would be a problem.

In my real world situation, one of the resources would be user_id and I would frequently need to sum the values for a user, so I would probably keep such a column in data rather than make it a part of the superKeys table for the sake of not having to join on every query. Then only use a join when I need to sum values for other resources which will be less often.

Courtney Miles
  • 3,756
  • 3
  • 29
  • 47
  • Maybe omit the surrogate key `id` alltogether ? If res1 ... res5 are a natural unique key another key would be superfluous. – wildplasser Feb 12 '13 at 00:36
  • Do you ever need to compare more or less than 5 resources? This seems like a highly denormalized schema. – Abe Miessler Feb 12 '13 at 00:38
  • Yes, vertical partitioning can be done to minimize the I/O for the "hot" half of the table. _However_, this makes sense only for large tables and 20 million doesn't really qualify as "large" on modern hardware, as @Catcall [so eloquently explained](http://stackoverflow.com/a/14823645/533120) (+1 to him). – Branko Dimitrijevic Feb 12 '13 at 13:34
  • 1
    BTW, your model lacks any (primary or alternate) keys. Unique _index_ is not the same thing as the unique _key_, and just declaring a field AUTO_INCREMENT is not the same thing as making a real primary key. Even if you are on MySQL, which tends to confuse these things, please use the proper terminology. – Branko Dimitrijevic Feb 12 '13 at 13:41
  • @Branko - Sincerely sorry for forgetting to indicate PRIMARY KEY. – Courtney Miles Feb 12 '13 at 21:22
  • @wildplasser - They are not naturally unique--combined with date they are unique in the data table. But the combinations or res1 to res5 will be duplicated many time in the data table, which is why I'm tempted to pull them out into their own table and have `data` only refer to the unique combinations of res1-res5. – Courtney Miles Feb 12 '13 at 21:27
  • @Abe Miessier - Can you explain? I had a mistake where superKeys had a value column, perhaps that's what made you think it was denormalised. – Courtney Miles Feb 12 '13 at 21:29
  • @Branko - What is the cost of vertical partitioning? – Courtney Miles Feb 12 '13 at 21:33
  • @user2045006 It's essentially the cost of JOIN (as you already noted), and some increase in storage space. The idea behind it is that you _won't have_ to do the JOIN in most situations, if the access pattern is just right. So, whether it's useful or not hangs on what kind of queries are most frequent. And even if the queries are "right", it's not very useful for tables that fit in RAM. – Branko Dimitrijevic Feb 12 '13 at 22:14
  • @Branko Thanks, that's the answer I was looking--one that explained the rationale between why you would and why you wouldn't. I just checked our worst DB, it has 56 mil rows, with a data length of 22 GB. The composite key actually has 8 resources, but I felt 5 was enough to illustrate my point. – Courtney Miles Feb 13 '13 at 00:39

1 Answers1

1

It won't reduce the size of the data. You'll have to store 20 million rows of data in the one table, and 20 million rows of superkeys in the other.

Five integers is 40 bytes. Multiply by 20 million--800 megabytes, plus a datetime column and a decimal. That whole table would fit in RAM on my netbook.

Keep the table "data". Drop the surrogate key.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
  • There was a mistake in my example where I had put a value column on superKeys. With that said, there might be 10,000 unique combinations of res1 to res5 on day 1, and 10,000 on day 2, but for the 2 days together, there might be only 12,000 unique combinations--definitely not 20,000. So there willbe 20,000 rows in `data` but only 12,000 in `superKeys`. – Courtney Miles Feb 12 '13 at 21:16