2

I'm working on a POC to showcase how Cassandra works. I took Digg as an example. I wanted to create a data model that'll let me:

1) Add links 2) Add a link to a user favorite list. 3) Attached predetermined tags to links

I came up with two Column Families:

  1. Links

    • url is the key
      • id (a generated uuid)
      • user (who added it)
      • favCount (no of users who favorited the link)
      • upCount (no of users who liked it)
      • downCount (no of users who disliked it)
  2. UserFavs

    • user is the key
      • id (as many ids as the user has favorited)

This works fine for requirements #1 and #2 above, but when I come to #3 it gets trickier. I can add tags like 'java', 'languages', 'architecture' as column names with empty values in the Links column family. But querying will take a long time, let's say if I were to find out all the links that were tagged under 'java'.

Can anyone throw some ideas of how this can be implemented.

If I'm not clear with the question, please let me know.

Thanks, Kumar

KumarM
  • 1,669
  • 1
  • 18
  • 26

1 Answers1

3

You could create a secondary index, i.e. a column family keyed on tag. Each row contains all the links for that particular tag. Note that this may result in very wide rows (i.e. with many columns) each of which will be stored on a single cassandra node. You might want a scheme to split these up if they get very large.

See http://www.datastax.com/docs/0.7/data_model/cfs_as_indexes

or http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/

or google cassandra secondary index

DNA
  • 42,007
  • 12
  • 107
  • 146
  • Thanks DNA. Is there a way to do it without needing another column family? A downside of having to insert a link to two different column families is that, since there are no transactions in cassandra at a multi column family level (I remember it that way, but I could be wrong) it might so happen that the link is inserted in the Links column family but not in to the say Tags column family. Or did I get you completely wrong? If so, please be specific with your suggestion. Thanks – KumarM Dec 11 '11 at 21:23
  • You are correct about transactions - this is a limitation of the Cassandra design. Retries or undo can be used to cope with this situation (which would be very rare in practice, especially if both inserts are sent in the same message). – DNA Dec 11 '11 at 21:37