I am considering the following schema for a MySQL 5.6 InnoDB table. I'm having trouble deciding which keys to use:
"CREATE TABLE IF NOT EXISTS `context` ("
" `text_id` INT NOT NULL AUTO_INCREMENT,"
" `doc_id` INT NOT NULL,"
" `text` VARCHAR(255),"
" PRIMARY KEY (`text_id`),"
" UNIQUE KEY `text_uk` (`text`)," <<< OPTION 1
" UNIQUE KEY `docidtext_uk` (`doc_id`, `text`)," <<< OPTION 2
") ENGINE=InnoDB "
I can either require the column text
be unique (OPTION 1), or alternatively I can allow for a few duplicate entries in text
and instead place a compound unique key on doc_id
+ text
(OPTION 2). Option 1 is clearly more efficient from a storage perspective, but query speed is by far the more important concern for us.
Given that text
is very long and therefore makes an inefficient unique key, I suspect that using the compound key instead may result in faster queries (note doc_id
is an integer and therefore relatively very efficient). I am not familiar enough with MySQL internals to understand if compound keys work this way, however.
If read speed is our primary concern, which of these options would be the best practice? Is there any harm or benefit to keeping both keys?
Notes:
- The only function that queries this table always has access to the values of
doc_id
andtext
(the sole purpose of this function is to look uptext_id
). - The table does not currently have data but we expect it will ultimately hold roughly 1 billion rows.
- Each
doc_id
has up to 99 duplicates in other rows of thedoc_id
column. - There are likely to be many millions of cells in the
text
column for which the first 20+ characters are identical (all of the text snippets are short and pertain to the same general topic). - The goal is deduplication of the
text
column. For this particular application doing so is expected to save a significant amount of space.