0

I am considering the following schema for a MySQL 5.6 InnoDB table. I'm having trouble deciding which keys to use:

    "CREATE TABLE IF NOT EXISTS `context` ("
    "  `text_id` INT NOT NULL AUTO_INCREMENT,"
    "  `doc_id` INT NOT NULL,"
    "  `text` VARCHAR(255),"
    "  PRIMARY KEY (`text_id`),"
    "  UNIQUE KEY `text_uk` (`text`),"                  <<< OPTION 1
    "  UNIQUE KEY `docidtext_uk` (`doc_id`, `text`),"   <<< OPTION 2
    ") ENGINE=InnoDB "

I can either require the column text be unique (OPTION 1), or alternatively I can allow for a few duplicate entries in text and instead place a compound unique key on doc_id + text (OPTION 2). Option 1 is clearly more efficient from a storage perspective, but query speed is by far the more important concern for us.

Given that text is very long and therefore makes an inefficient unique key, I suspect that using the compound key instead may result in faster queries (note doc_id is an integer and therefore relatively very efficient). I am not familiar enough with MySQL internals to understand if compound keys work this way, however.

If read speed is our primary concern, which of these options would be the best practice? Is there any harm or benefit to keeping both keys?

Notes:

  • The only function that queries this table always has access to the values of doc_id and text (the sole purpose of this function is to look up text_id).
  • The table does not currently have data but we expect it will ultimately hold roughly 1 billion rows.
  • Each doc_id has up to 99 duplicates in other rows of the doc_id column.
  • There are likely to be many millions of cells in the text column for which the first 20+ characters are identical (all of the text snippets are short and pertain to the same general topic).
  • The goal is deduplication of the text column. For this particular application doing so is expected to save a significant amount of space.
  • Indexes use B-trees. Efficiency is dependent on how often you have rows with a long, common prefix of the value. – Barmar Apr 29 '15 at 19:53
  • You also need to consider the application. What if two people write documents that just happen to have the same `text`? Making it a unique column will not allow that. BTW, 255 characters is not really very long. – Barmar Apr 29 '15 at 19:57
  • Barmar, thanks for your reply. If we placed the unique key on the `text` column alone, the two documents (and perhaps many more) would share the same `text_id`. The `doc_id` column would be omitted from the table in this case. – Matthew Blum Apr 29 '15 at 20:07
  • A 255-char index would not normally concern me but this is a big table - a good billion rows - and there are only a few duplicate `doc_id`s. – Matthew Blum Apr 29 '15 at 20:10
  • As I said above, the full size of the column is not an issue. What matters is how many values will have long prefixes in common. If they can usually be distinguished in the first 10-20 characters, it's effectively the same as an index on a 20-character column. – Barmar Apr 29 '15 at 20:13
  • Have you studied how B-trees work? That should make this clearer. – Barmar Apr 29 '15 at 20:14
  • Why are you even "saving" space in the first place? You're having space issues? Also, your entire premise is silly, you're making a unique index out of text instead of hashing your text and making unique index out of that. – N.B. Apr 30 '15 at 12:47

3 Answers3

0

Have Text_id as primary key.
Add an index on doc_id and text and include text_id when you start to get perfomance problems. Be carefull of premature optimization.

Zohar Peled
  • 79,642
  • 10
  • 69
  • 121
0

What is the goal?

  • Dedup text to save space? Seems unlikely to save enough to matter.
  • 'Normalize' text so that if you need to change a string, you can change it in one place? No hint of that.
  • Discover which doc contains a given text? Apparently not needed.
  • Save space in several other tables that would otherwise have text in them? Apparently there is only one other table.

Conclusion: Since you don't seem to need anything that justifies this table, so

KISS.

Just put text in the table that needs it. If I'm wrong on one of the 4 bullet points, I'll change my answer.

Edit

If there will be a lot of duplicate strings, and the goal is dedupping, I recommend

CREATE TABLE context (
    text_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
    text VARCHAR(255) NOT NULL,
    PRIMARY KEY(text_id),
    UNIQUE(text)            -- for finding dup
) ENGINE=InnoDB;            -- because of the way it clusters

INSERT INTO context (text) VALUE ('$escaped_text')
    ON DUPLICATE KEY UPDATE text_id = LAST_INSERT_ID(text_id);  -- see ref manual
$text_id = LAST_INSERT_ID();

If the strings tend to be long, use VARBINARY instead, and compress/uncompress in the client. For typical text, compression will shrink text by a factor of 3.

Rick James
  • 135,179
  • 13
  • 127
  • 222
0

From what I read, you need to connect documents to text. This text, for some funny reason, can be duplicated. Personally, I would never look for duplicated text and optimize for that but people are funny so here goes:

  • Create a table that holds text only. Hash the text (so you get less data) and make unique index out of that. Your text can now be arbitrarily long.

  • Insert into the table that links documents and texts, linking them via foreign keys.

Congratulations, you've micro-optimized something. Now, if you decide to really, really squeeze all possible space you can get and super optimize stuff then you might turn to this storage engine for MySQL, and you'll get up to whooping 12 times space reduction compared to InnoDB.

N.B.
  • 13,688
  • 3
  • 45
  • 55