24

I have a photo website and i want to support tags as my original category bucketing is starting to fail (some pictures are family and vacations, or school and friends). Is there an agreed tagging db schema?

I still want to support having photos as part of an album.

Right now i have a few tables:

Photos

  • PhotoID
  • PhotoAlbumID
  • Caption
  • Date

Photo Album

  • AlbumID
  • AlbumName
  • AlbumDate
Jeff Atwood
  • 63,320
  • 48
  • 150
  • 153
leora
  • 188,729
  • 360
  • 878
  • 1,366

7 Answers7

31

There are various schemas which are effective, each with their own performance implications for the common queries you'll need as the number of tagged items grows:

Personally, I like having a tag table and a link table which associates tags with items, as it's denormalized (no duplication of tag names) and I can store additional information in the link table (such as when the item was tagged) when necessary.

You can also add some denormalised data if you're feeling frisky and want simple selects at the cost of the additional data maintenance required by storing usage counts in the tag table, or storing tag names which were used in the item table itself to avoid hitting the link table and tag table for each item, which is useful for displaying multiple items with all their tags and for simple tag versioning... if you're into that sort of thing ;)

hansaplast
  • 11,007
  • 2
  • 61
  • 75
Jonny Buchanan
  • 61,926
  • 17
  • 143
  • 150
15

I've done this in a small system without very many users, but I've wondered before if there was an "accepted" way to manage tags. After reading through the links posted by insin and plenty of other blog posts on tagging, it seems that the accepted way is to store it fully normalized and cache certain things if your data set gets too big.

Since it's a many-many relationship (each tag can belong to any number of photos - each photo can have many tags), relational database theory has you create a photo table, a tag table and a cross-reference table to link them.

photos
  photoid
  caption
  filename
  date

tags
  tagid
  tagname

phototags
  photoid
  tagid

This has scaling problems selecting from really large datasets, but so do all the less-normalized schemas (sorting and filtering by a text field will probably always be slower than using an integer, for example). If you grow as large as delicious or maybe even StackOverflow you'll probably have to do some caching of your tag sets.

Another issue you'll have to face is the issue of tag normalization. This doesn't have anything to do with database normalization - it's just making sure that (for example) the "StackOverflow", "stackoverflow" and "stack overflow" tags are the same. Lots of places disallow whitespace or automatically strip it out. Sometimes you'll see the same thing for punctuation - making "StackOverflow" the same as "Stack-Overflow". Auto-lowercasing is pretty standard. You'll even see special case normalization - like making "c#" the same as "csharp".

Happy tagging!

Neall
  • 26,428
  • 5
  • 49
  • 48
  • what is the best answer for tag normalization? – leora Oct 06 '08 at 21:10
  • 1
    I think it would depend a lot on what you're tagging. I prefer heavy normalization though - you really want to keep the tag count low. – Neall Oct 07 '08 at 11:45
3

A quick note on how to handle tags:

Tagging systems can vary from very rigidly defined tags, where creating new ones require explicit extra work (think gmail labels) to very loose systems where adding as many tags as possible is encouraged (think flickr, or tagging audio content where a transcription may be applied directly as tags).

In general, an easily-indexable media (text!) should have a more rigid system, since the content is already searchable. Additional tags exist more for categorization only, and categorization is only helpful when different users are broadly assigning things into the same categories. If you have raw text, it should take nearly an act of God to create a new tag.

On the other hand, media that's more difficult to index (images, video, audio) should have a flexible system that encourages many tags, since they and other metadata are your only hope when searching.

This is important because the database schema you want could change somewhat depending on which end of that spectrum you find yourself.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
2

Something like this comes to my mind: add those two tables

Tags

  • TagID
  • TagName
  • TagDescription

PhotoTags

  • PhotoID
  • TagID

You can extend this to albums too, having an intersection table between Photo Albums and Tags.

friol
  • 6,996
  • 4
  • 44
  • 81
  • how do you avoid people not using the same name for a tag (like case sensitivities. etc) – leora Oct 05 '08 at 21:30
  • 1
    @akantro: give them a picklist of tags, to encourage them to pick a tag instead of typing in a new one. Or the modern equivalent, a text box with AJAX-driven tag completion. – Bill Karwin Oct 05 '08 at 21:42
  • is there a prebuilt control for AJAX driven tag completion ? – leora Oct 05 '08 at 22:09
2

I suggest looking to see how established open-source software does it. For example, Gallery stores its meta-data in a database like you do, and is pretty rich.

I don't think you'll find a "standard" schema, though. The closest thing I can think of is the EXIF meta-data format, which is embedded within image files themselves (by cameras, etc).

skaffman
  • 398,947
  • 96
  • 818
  • 769
2

if you want real performance with millions of records, you could store tags in one field, comma separated and retreive records with a full-text index/search daemon like sphinxsearch. All you have to add, is a table listing all tags with a count value to know how often they are attached to an item.

I know it's not the usual way and a little more complicated than a pure database solution, but it's really really fast to search tag related items.

You could use full-text search fonctionnality of your database engine too, but when there's lots of records, most engines tend to be slow.

If it's for a small project, you can go your way, seams good and proper way to do. But I would just share with you this other solution. What do you think of ?

0

In my app BugTracker.NET, I make an assumption that there won't be TOO many bugs. Maybe tens of thousands, but not tens of millions. That assumption allows me to cache the tags and the ids of the items they reference.

In the database, the tags are stored as they are entered, with the bugs, in a comma delimited text field.

When a tag field is added or changed, that kicks off a background thread that selects all bugids and their tags, parses the text, building a map where the key is the tag and the value is a list of all the ids that have that tag. I then cache that map in the Asp.Net Application object.

Below is the code I've just described.

The code could be optimized so that instead of going through all the bugs it just incrementally modified the cached map, but even unoptimized, it works fine.

When somebody does a search using a tag, I look up the value in the map, get the list of ids, and then fetch those bugs using SQL with "where id in (1, 2, 3...)" clause.

    public static void threadproc_tags(object obj)
    {
        System.Web.HttpApplicationState app = (System.Web.HttpApplicationState)obj;

        SortedDictionary<string,List<int>> tags = new SortedDictionary<string,List<int>>();

        // update the cache
        DbUtil dbutil = new DbUtil();
        DataSet ds = dbutil.get_dataset("select bg_id, bg_tags from bugs where isnull(bg_tags,'') <> ''");

        foreach (DataRow dr in ds.Tables[0].Rows)
        {
            string[] labels = btnet.Util.split_string_using_commas((string) dr[1]);

            // for each tag label, build a list of bugids that have that label
            for (int i = 0; i < labels.Length; i++)
            {

                string label = normalize_tag(labels[i]);

                if (label != "")
                {
                    if (!tags.ContainsKey(label))
                    {
                        tags[label] = new List<int>();
                    }

                    tags[label].Add((int)dr[0]);
                }
            }
        }

        app["tags"] = tags;

    }
Corey Trager
  • 22,649
  • 18
  • 83
  • 121
  • Normalizing would be slower for reads, inserts, updates. and deletes. Why *WOULD* I normalize? More importantly, regardless of how the tag data is stored physically, when I actually use it for the searches, I'm using what's cached in memory. The db schema itself is actually irrelevant at search time. – Corey Trager Sep 10 '09 at 02:18