6

I recently ran into a case where Cassandra fits in perfectly to store time based events with custom ttls per event type (the other solution would be to save it in hadoop and do the bookkeeping manually (ttls and stuff, IMHO a very complex idea) or switch to hbase). The question is how good the cassandra MapReduce support works out of the box without Datastax Enterprise edition.

It seems that they invested a lot in CassandraFS but I ask myself if the normal Pig CassandraLoader is actively maintained and actually scales (as it seems to do nothing more than to iterate over the rows in slices). Does this work for 100s of millions of rows?

Tobias
  • 583
  • 3
  • 15

2 Answers2

1

You can map/reduce using random partitioner but of course the keys you get are in random order. you probably want to use CL = 1 in cassandra so you don't ahve to read in from 2 nodes each time while doing map/reduce though and it should read the local data. I have not used Pig though.

Dean Hiller
  • 19,235
  • 25
  • 129
  • 212
  • The Pig support for Cassandra uses the ColumnFamilyInputFormat and -OutputFormat. So whatever You can or can't do in hadoop maps fairly well to what you cna and can't do with Cassandra and Pig. – Chris Gerken Nov 02 '12 at 03:26
  • and is it actually fast using the random partitioner? I guess it just does something like this? http://stackoverflow.com/questions/8418448/cassandra-hector-how-to-retrieve-all-rows-of-a-column-family - I tried to iterate a 100 mio row CF manually once and it never actually started after it sent the first rangeslicequery. – Tobias Nov 02 '12 at 06:53
  • that link doesn't look like map/reduce as map/reduce implements a Mapper and Reducer or something ...I need to set it up again soon and don't have code from my previous project...yes it is fast since all of them run in parallel...the start is slow just like hadoop as it delivers code to each task tracker. – Dean Hiller Nov 02 '12 at 18:56
  • "Hadoop" and "fast" don't really go together. That's the nature of sequential scans. But C* scans are faster than HBase, if that makes you feel better: http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf – jbellis Nov 04 '12 at 05:27
  • I understood the nature of Hadoops & Batches. I just tried to iterate over all rows (100.000.000 rows) in cassandra a cassandra cf (random partitioner) which took ages and I aborted. I was just asking myself if Map Reduce through hadoop uses the same mechanisms. – Tobias Nov 08 '12 at 06:47
  • how many servers are you using to do 100,000,000 rows? The more servers the faster....one server would take a while. – Dean Hiller Nov 08 '12 at 12:17
-2

Why not hbase? Hbase is more suitable for timeseries data. You can easily put billions of rows on very small cluster and get up to 500k rows per second on small 3node cluster (up to 50MB/s) with WAL enabled. Cassandra has several flaws:

  1. In cassandra you actually restricted by amount of keys (imagine, that in case of billions rows your repair would work forever). So you will design schema, which will 'shard' you time by, say, 1 hour, and actual timestamp will be placed as columns. But such scheme don't scale well due of high risk of 'huge columns'.
  2. Other problem - you can't mapreduce range of data in cassandra, except you use ordered partitioner, which is not an option at all, due of its inability to balance well.
octo
  • 665
  • 3
  • 8
  • It's because I am already using cassandra in the project and don't really want to introduce new technology... – Tobias Nov 01 '12 at 14:23
  • Good point. If it is okay to process all data all the time - This should work, but if data will grow, I recommend to reconsider to use more adapted for mapreduce workload storage. – octo Nov 01 '12 at 15:13
  • What nonsense is this? Many (most?) Cassandra clusters support billions of rows quite well. You mention repair but that is of course distributed as well. – jbellis Nov 04 '12 at 05:29
  • It is true that Cassandra discourages relying on global ordering for your data model but this is not much of a downside, particularly with Cassandra's built-in support for column indexes (which are supported in map/reduce as well). – jbellis Nov 04 '12 at 05:31