1

This is somewhat of a meta question, but because it relates to a database design, I thought I should post it here.

I'm building a site that includes Q+A and was wondering how I should structure my SQL database, so naturally, I looked to the best of the best. However, the Stack Exchange database schema seems to defy what I've learned about creating maintainable/extensible table hierarchies.

Stack Exchange's database schema for posts, as shown in the data explorer

As you can see, Stack Exchange stores all of its "Posts" in one table, except for comments, which has its own table. Post types include questions, answers, and various wiki things. This results in a lot of NULL columns in the table. For example, questions have titles, tags, and answerCounts, while answers don't, so all answer entries have NULL for all three of those columns. If more post types are added over time, this will progressively become less maintainable. And the fact that comments is the only type of post that has its own table just seems inconsistent.

What I've read states that it's generally preferred to use an object subclass hierarchy, in which there's a generic "Posts" table along with a bunch of tables for each type of post that all have one column that maps back to the corresponding entry in the "Posts" table. This keeps the number of null columns to a minimum and makes it more extensible, but slows down queries because they'll require more joins.

So why does Stack Exchange use this giant table method? Is it just the result of ages of modifications to an old database? More specifically, should I use this model for my own Q+A system or stick with an object subclass hierarchy (my Q+A/forum system will closely resemble SO's, with several types of posts including questions, answers, polls, reviews, etc.)?

Community
  • 1
  • 1
John Qian
  • 123
  • 9
  • How is it less maintainable? Also, you've been reading the wrong things. Read Martin Fowler. He suggests Single Table Inheritance as the go-to strategy. – Neil McGuigan Aug 08 '14 at 18:43

1 Answers1

4

This is a classic case of so-called "Object-relational impedance mismatch". Specifically, you are taking about mapping OO's inheritance into a relational database structure. There are several common ways of doing that -

  • A table per subclass,
  • A table per leaf subclass, and
  • A table per class hierarchy (with a discriminator)

Each of these strategies is perfectly valid. Moreover, the structures could be mixed as needed.

It looks like Stack Exchange used a table per class hierarchy approach, with PostTypeId serving as a discriminator. This approach is as valid as any other approach that they could have taken. It is also one of the simplest ones to take from the maintenance standpoint, because it lets you construct manual queries with less work.

There is another thing in the structure of the table that you did not mention: it is not normalized. Specifically, there are AnswerCount and CommentCount fields that store information that could be obtained by aggregating the table (i.e. running a SELECT COUNT(*) FROM ... WHERE ... AND other.ParentId = p.Id ...) This is a common tradeoff between normalization and speed of execution: most likely, the profiling has indicated that the aggregation takes significant amount of time, so the counts have been moved into the "parent" record.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • I see, so I guess it's perfectly fine to have a large amount of null columns? Also, I don't understand what you mean by "It looks like Stack Exchange used a table per class hierarchy approach", as it looks to me like they only have one table for all the classes. I did notice the lack of normalization and I find it completely logical, but I don't see how it's relevant. – John Qian Aug 08 '14 at 16:18
  • @JohnQian Posts table stores the data for all objects of a single class hierarchy, consisting of questions and answer subclasses, and a base class for AbstractPost. Comment table holds data for a separate class hierarchy consisting of a single class `Comment`. There is a separate hierarchy of users as well. I mentioned lack of normalization as а potential departure from best practices. – Sergey Kalinichenko Aug 08 '14 at 16:26