7

I am designing a database which will be used for internationalized content. One of the features is a tagging system, similar to what Stack Overflow has. This is what I've got:

TABLE tags
tag_id | int(11) unsigned | NOT NULL   | PRI | auto_increment 

TABLE tag_translations
| tag_translation_id | int(11) unsigned | NOT NULL   | PRI | auto_increment 
| fk_language_id     | int(11) unsigned | NOT NULL   | MUL |                
| fk_tag_id          | int(11) unsigned | NOT NULL   | MUL |                 
| tag_name           | varchar(255)     | NOT NULL   | UNI |  

TABLE tag_relationships
| tag_relationship_id | int(11) unsigned | NOT NULL   | PRI | auto_increment
| fk_tag_id           | int(11) unsigned | NOT NULL   | MUL | 
| fk_solution_id      | int(11) unsigned | NOT NULL   | MUL |  

First of all, does it seem reasonable to have that tags table containing nothing but an ID? Second, how could I populate that column with the only field being an auto incrementing ID?

Calvin Froedge
  • 16,135
  • 16
  • 55
  • 61
  • I stumbled against this problem a couple of years ago, using SQL Server. I found no solution other than change the design of my tables. :-( – CesarGon Jun 26 '11 at 01:11
  • I can't think of a reason I'd ever have a table with nothing but an auto-increment ID. It feels completely wrong. I take it that the tags won't start with one primary language, which will then be internationalized later? – ETWW-Dave Jun 26 '11 at 01:13
  • An obvious solution I just thought of to the second problem (populating the column) is adding something that isn't arbitrary, but doesn't really change much, just for the sake of having something to insert - like a timestamp for when it was created. – Calvin Froedge Jun 26 '11 at 01:13
  • Haha, yes it does feel wrong, that's why I'm asking. However, I can't come up with a justification for why it IS wrong. Tags might exist as multilingual in some cases, but only an english version is present in other cases, in which case we will either do a machine translation or we will default to english. They could be inserted one language at a time, or 10 languages at once. – Calvin Froedge Jun 26 '11 at 01:15
  • Why not put the tag_name in the tags table and make the tag_name in your tag_translations table a tag_translation? – kinakuta Jun 26 '11 at 01:15
  • 1
    @kinakuta Then we're duplicating the english tag name....Think about what if the tag name gets modified later? Do we then update both the translation and the name or just the translation? I'd like to avoid duplication. What if (at some point) we have tag names in other languages but not in english? Then we've got multilingual tag names in our tags column. – Calvin Froedge Jun 26 '11 at 01:19
  • Then get rid of the tags table. Or rather get rid of what is currently the tags table and rename your translations table to just tags. – kinakuta Jun 26 '11 at 01:20
  • 1
    @kinakuta Hehe. Then the english and spanish translations for 'woodworking' don't have the same ID. Referential integrity is broken. See what I'm up against? – Calvin Froedge Jun 26 '11 at 01:22
  • This is *exactly* the same problem I had! – CesarGon Jun 26 '11 at 01:24
  • 1
    If a tag has some identity beyond the lexical designation, which is what you're insinuating here by suggesting that two words can refer to the same thing, then that is what has to go into your tags table. Whether that be the creator, the time created, or what, that is what you need to decide because you've now defined it as something other than a word. – kinakuta Jun 26 '11 at 01:27
  • @kinakuta Haha, so I suppose that the answer here is that - A. It is impossible to do an insert in MySQL with only one column which is an auto_incrementing ID, so I MUST define some other arbitrary reference to accompany the ID. Sound right? – Calvin Froedge Jun 26 '11 at 01:30
  • No - this works - INSERT INTO tbl_name () VALUES(); – Calvin Froedge Jun 26 '11 at 01:34
  • Well, I don't look at it as arbitrary - I look it as how are you choosing to define your data. In the case of a tag, I look at a tag as a word. If I want to search on a word, any search is going to involve searching using the letters of the word. In other words, the same "word" in another language is something else in my view. I don't see book and buch as the same tag. The notion of language is abstracted away in this model as words are a sequence of letters rather than a bundle of meaning. For the purpose of a tag system, this is all they need to be. – kinakuta Jun 26 '11 at 01:37
  • If I wanted to (for example) search for a particular tag, I'd probably be passing the user's language as a parameter and searching the tag_translations table, not the tag table. – Calvin Froedge Jun 26 '11 at 01:38
  • Btw, Kina, I appreciate the interest and help. If you want to provide an answer, I'll upvote it. – Calvin Froedge Jun 26 '11 at 01:40
  • 1
    What about when the same word is used for multiple languages? – kinakuta Jun 26 '11 at 01:41
  • Haha, had not considered the ramifications of that. I'll need to come up with a solution for it = ) – Calvin Froedge Jun 26 '11 at 01:46
  • 1
    This came up in a recent question [here](http://stackoverflow.com/questions/6431002/database-pk-fk-design-for-future-effective-date-entries/6431131#comment-7548258) where someone realized that they didn't actually know anything about their employees that might not change at some point -- gradually every field seemed like it had to be moved off to a history table that had `EffectiveDate`s (even names can change!). – Chris Cunningham Jun 26 '11 at 01:47
  • I understand the thinking behind your table design, but I don't think you actually gain anything by separating the tables in this particular case. Even if shoes and zapatos mean the same thing, you'll be doing an additional join for both your insertion and your search and you'll end up with the same number of records either way (you'll have a record for each translation) plus an additional table. – kinakuta Jun 26 '11 at 01:55

3 Answers3

2

As I say in my comments to the OP, I had exactly the same problem a few years ago. I was using SQL Server rather than MySql, though, but the problem was the same.

Unfortunately, the only solution I found was to add extra columns to the Tags table. I decided to add a DateCreated column which, eventually, turned out to be useful.

CesarGon
  • 15,099
  • 6
  • 57
  • 85
1

Q1: No. I can't see the point of the tags table. It seems not to mean anything. If you can't explain the semantic value of a row in that table (what does an entry in that table represent in the real world?), then it probably doesn't belong. I suspect it is only there in order to give you auto_increment, which would not be a good answer. You can assign your own IDs.

Q2: already answered.

I also don't see what tag_translation_id and tag_relationship_id are for. A habitual use of auto_increment?

I think what I'd do for basic structure is:

create table tag_translations (
  tag_id int not null,
  language_id int not null,
  tag_name varchar(255),
  primary key (tag_id, language_id)
);

create table tag_relations (
  tag_id int not null,
  solution_id int not null,
  primary key (tag_id, solution_id)
);

To which I'd add metadata and indexes as needed. Two column indexes are very nice for junctions such as tag_relations because of the server's 'Using index' optimization:

Using index

The column information is retrieved from the table using only information in the index tree without having to do an additional seek to read the actual row. This strategy can be used when the query uses only columns that are part of a single index.

Btw, for internationalized systems, 255 isn't a magic number for varchar field length unless you're sticking to single-byte encodings. If you're using UTF-8, look at the manual and give it some thought, especially if you are going to index that column.

0
INSERT INTO tbl_name () VALUES(); 
Benny Hill
  • 6,191
  • 4
  • 39
  • 59
Calvin Froedge
  • 16,135
  • 16
  • 55
  • 61
  • 1
    `INSERT INTO tbl_name VALUES()` is enough. The server will attempt to set all columns to their respective default or automatic values. –  Jun 27 '11 at 18:24