5

Version Dependent

Some of the answers to this question deal with older versions of Cassandra. The correct answer for this kind of problem depends on the version of Cassandra you are using.


I have a profile column family and want to store a list of skills in each profile. I'm not sure how this is typically accomplished in Cassandra. One option would be to store a serialized Thrift or protobuf, but I'd prefer not to do this as I believe Cassandra doesn't have knowledge of these formats, and so the data in the datastore would not not human readable or queryable via CQL from the command line. The other solution I thought of would be to use a super column and put the skill as the key with a null value:

skills: {
  'java': '',
  'c++': '',
  'cobol': ''
}

Is this a good way of handling lists in Cassandra? I imagine there's some idiom I'm not aware of. I'm using the Astyanax client library, which only supports composite columns instead of super columns, and so the solution I proposed above would seem quite awkward in that case. Though I'm still having some trouble understanding composite columns as they seem not to be completely documented yet. Would this solution work with composite columns?

Raedwald
  • 46,613
  • 43
  • 151
  • 237
Ben McCann
  • 18,548
  • 25
  • 83
  • 101

3 Answers3

4

This answer dates to before the release of Cassandra version 1.2, which provided substantially different functionality for handling lists. The answer might be inappropriate if you are using Cassandra 1.2+.


I would encode lists in the column key, using composite columns with the real column name as the first dimension, ie:

row_key -> {
     [column_name; entry1] -> "",
     [column_name; entry2] -> "",
     ... 
}

Then, to read the list, you would need to do a get_slice from [column_name; ] to [column_name; ] - note the empty dimensions.

The great thing about this is it actually implements a set quite nicely; the list cannot contains the same thing twice. I think thins works in your usecase. The list would also be maintained in sorted order.

Raedwald
  • 46,613
  • 43
  • 151
  • 237
tom.wilkie
  • 2,846
  • 20
  • 16
  • Thanks. To clarify, column_name would be 'skills' and entry1, entry2, etc. would be 'java', 'c++', etc. And then would row_key be the user id and the other user attributes would go where the ellipses are? Or can you have a composite super column with 'skills' being the column name in the user column family and the value being the composite column you've shown? – Ben McCann Mar 26 '12 at 17:39
  • 1
    That's correct. I would avoid mixing composite and super columns, to keep it simple. – tom.wilkie Mar 27 '12 at 08:07
  • After a lot of reading it would seem that I can't have regular columns and super columns in the same column family, so I'm not sure how your solution would work. Where would I store the user's other attributes like 'firstName, 'lastName', etc.? – Ben McCann Mar 28 '12 at 04:33
  • this solution does not involve super columns, just composite columns. – MeBigFatGuy Apr 03 '12 at 03:07
  • This approach would not allow a secondary index on this field. In order to search by this field another index column family (or something to that effect) would have to be stored. – Ophir Radnitz Aug 06 '12 at 11:48
3

In older versions of Cassandra, you had to serialize the list yourself and store it in a column, or perhaps use a super column.

Since version 1.2 of Cassandra, CQL3 has collection types for columns, so you can give list<text> as the type of a column in your schema. For example:

 CREATE TABLE Person (
    name text,
    skills list<text>,
    PRIMARY KEY (name)
 );

Or you could use set<text> if you want to automatically eliminate duplicates.

Daoyu Tu
  • 3
  • 1
  • 3
Raedwald
  • 46,613
  • 43
  • 151
  • 237
3

This answer dates to before the release of Cassandra version 1.2, which provided substantially different functionality for handling lists. The answer might be inappropriate if you are using Cassandra 1.2+.


As mentioned on the mailing list, my preference which has worked very well for me, is to store a single column "skills" with the value being a serialized JSON string.

Really comes down to the usage patterns you have for "skills".

  • If "skills" are just for CRUD on a per user basis, this is fine.
  • If you want to be able to search for all users that have a skill of "cobol", then I would still recommend this approach and have another row that is skill:cobol that has a column of UUID and a value of timestamp or something similar ...
  • I'm sure with Pig/Hadoop integration to your cassandra nodes, you could also still quite happily query all of the users that have x,y and z to generate new data to support additional use cases.
Raedwald
  • 46,613
  • 43
  • 151
  • 237
sdolgy
  • 6,963
  • 3
  • 41
  • 61
  • Thanks. I'm leaning towards the JSON solution. I just realized that with Pig I can do much more advanced things, so I'm also going to take a look at Elephant Bird (https://github.com/kevinweil/elephant-bird) – Ben McCann Mar 28 '12 at 15:31
  • It seems to me this approach fits where later changes to the list are not an issue. This approach would require reading the column in order recreate the JSON string with the newer data and save it again. – Ophir Radnitz Aug 06 '12 at 11:45