What is the Cassandra database schema used in Reddit?

Question

Reddit is currently migrating its database from PosgreSQL to Apache Cassandra. Does anybody know what database schema does Reddit use in Cassandra?

I don't, and I'm not sure anyone outside of Reddit does, but should it really matter? I expect quite a few people here on SO could help determine the schema that's right for *your* application. — the paul, May 03 '12 at 18:59
Reddit published the code that powers the site on GitHub: https://github.com/reddit/reddit . I could search in the code and determine the schema from there. But I thought that is easier to ask here. — Calin-Andrei Burloiu, May 05 '12 at 17:24
I glanced through the code, and I see something like two dozen different columnfamilies that get created and used in different ways. Is there an area in particular you're wondering about, or are you looking more for something like `show schema` output from cassandra-cli? — the paul, May 05 '12 at 21:45
I want to design an application that uses trees of comments as in Reddit. So, since my application is similar, I am trying to get some inspiration from there. — Calin-Andrei Burloiu, May 17 '12 at 09:42

score -2 · Answer 1 · answered Jan 02 '14 at 06:03

I also don't know the exact Reddit schema, but for what you want to archive, you are on the right way, saving a hierarchy of comments in a document based database instead of a relational database. I would recommend to keep one document for each root-comment, and then add all the children (and children of the children) to that comment.

In CouchDB and MongoDB you can store JSON documents directly. In Cassandra I would save the JSON as a String. So the data structure would be only

root-comments
{
    root-comment-id
    root-comment-json-string
}

and each root-comment-json-string would look like this:

{
comment : "hello world"
answers : 
[
    {
    comment : "reply to hello world"
    answers : 
    [
        {
        comment : "thanks for the good reply"
        answers : []
        },
        {
        comment : "yes that reply was indeed awesome"
        answers : []
        }

    ]
    }

]
}

additionally you might want to add a UserName, UserID, Timestamp, .... etc. to the structure of each comment.

This 'denormalized' structure will make make the queries very fast compared to a normalized relational stucture IF you have A LOT of data.

In any case you will have to take care of all the exceptions, that can happen when you implement such a system for a large user scale, eg. What happens if someone replies to comment A with comment B, but at the same time (or later) comment A is deleted.

If you search the internet for "cassandra hierarchical data" you find some other approaches, but they all go back to normalisation or they are not complete for a 'infinite' hierarchy.

The problem with the approach you describe is that any time a new comment is added you have to update the JSON, that is, parse it, merge the comment in it and then write it to Cassandra. Imagine a tree with thousands of comments. So this approach is cheaper for retrieval, but expensive when updating. The relational approach is the other way around, expensive when retrieving a comment tree and cheap when updating. I think the best approach is a hybrid one, store the most import comments as you describe and the least important in a relational manner. — Calin-Andrei Burloiu, Jan 02 '14 at 11:45
@Calin-AndreiBurloiu Yes true. It is my understanding that such a comment system (like in reddit) has far more reads than updates. So my answer is exactly the correct solution. — Kenyakorn Ketsombut, Jan 06 '14 at 03:01
And the parsin of the result can easily done in JavaScript on the client, since it is JSON. — Kenyakorn Ketsombut, Jan 06 '14 at 03:03
Kind of a newbie to this and your post helped to explain the structure a lot. In this case, how would inserting a comment work? I imagine there's a method addComment(parentId,commentString) - how do we find that particular node to insert a new child? — Nathan, May 20 '14 at 07:31

What is the Cassandra database schema used in Reddit?

1 Answers1