14

We have a lot of user interaction data from various websites stored in Cassandra such as cookies, page-visits, ads-viewed, ads-clicked, etc.. that we would like to do reporting on. Our current Cassandra schema supports basic reporting and querying. However we also would like to build large queries that would typically involve Joins on large Column Families (containing millions of rows).

What approach is best suited for this? One possibility is to extract data out to a relational database such as mySQL and do data mining there. Alternate could be to attempt at use hadoop with hive or pig to run map reduce queries for this purpose? I must admit I have zero experience with the latter.

Anyone have experience of performance differences in one one vs the other? Would you run map reduce queries on a live Cassandra production instance or on a backup copy to prevent query load from affecting write performance?

NG Algo
  • 3,570
  • 2
  • 18
  • 27

2 Answers2

13

In my experience Cassandra is better suited to processes where you need real-time access to your data, fast random reads and just generally handle large traffic loads. However, if you start doing complex analytics, the availability of your Cassandra cluster will probably suffer noticeably. In general from what I've seen it's in your best interest to leave the Cassandra cluster alone, otherwise the availability starts suffering.

Sounds like you need an analytics platform, and I would definitely advise exporting your reporting data out of Cassandra to use in an offline data-warehouse system.

If you can afford it, having a real data-warehouse would allow you to do complex queries with complex joins on multiples tables. These data-warehouse systems are widely used for reporting, here is a list of what are in my opinion the key players:

  • Netezza
  • Aster/TeraData
  • Vertica

A recent one which is gaining a lot of momentum is Amazon Redshift, but it is currently in beta, but if you can get your hands on it you could give this a try since it looks like a solid analytics platform with a pricing much more attractive than the above solutions.

Alternatives like using Hadoop MapReduce/Hive/Pig are also interesting to look at, but probably not a replacement for Hadoop technologies. I would recommend Hive if you have a SQL background because it will be very easy to understand what you're doing and you can scale easily. There are actually already libraries integrated with Hadoop, like Apache Mahout, which allow you to do data-mining on a Hadoop cluster, you should definitely give this a try and see if it fits your needs.

To give you an idea, an approach that I've used that has been working well so far is pre-aggregating the results in Hive and then have the reports themselves generated in a data-warehouse like Netezza to compute complex joins .

Charles Menguy
  • 40,830
  • 17
  • 95
  • 117
  • 1
    Thank you very much for your insightful comments. I infact came across another article [here.](https://www.quantivo.com/blog/top-5-reasons-not-use-hadoop-analytics). The thoughts there echo yours in terms of usage of hive/hadoop. I now believe using hadoop/hive/pig to pre-aggregate data and then dump into a relational data warehouse for analytics is the right way to go. – NG Algo Jan 26 '13 at 01:42
8

Disclosure: I'm an engineer at DataStax.

In addition to Charles' suggestions, you might want to look into DataStax Enterprise (DSE), which offers a nice integration of Cassandra with Hadoop, Hive, Pig, and Mahout.

As Charles mentioned, you don't want to run your analytics directly against Cassandra nodes that are handling your real-time application needs because they can have a substantial impact on performance. To avoid this, DSE allows you to devote a portion of your cluster strictly to analytics by using multiple virtual "datacenters" (in the NetworkToplogyStrategy sense of the term). Queries performed as part of a Hadoop job will only impact those nodes, essentially leaving your normal Cassandra nodes unaffected. Additionally, you can scale each portion of the cluster up or down separately based on your performance needs.

There are a couple of upsides to the DSE approach. The first is that you don't need to perform any ETL prior to processing your data; Cassandra's normal replication mechanisms keep the nodes devoted to analytics up to date. Second, you don't need an external Hadoop cluster. DSE includes a drop-in replacement for HDFS called CFS (CassandraFS), so all source data, intermediate results, and final results from a Hadoop job can be stored in the Cassandra cluster.

Tyler Hobbs
  • 6,872
  • 24
  • 31
  • Hi Tyler, thank you very much! You have brought up an extremely interesting perspective! Can you point me to some documentation detailing DSE's ability to devote a portion of a Cassandra cluster to analytics? – NG Algo Jan 28 '13 at 06:54
  • I'm having trouble finding a single page that summarizes it, but read about [NetworkTopologyStrategy](http://www.datastax.com/docs/1.1/cluster_architecture/replication#networktopologystrategy) (if you're not familiar with it). The CFS data only has replicas in the analytics "DC", and the Cassandra data has replicas in both. Normal Cassandra nodes can write at LOCAL_QUORUM, and analytics nodes can read at LOCAL_QUORUM, so neither DC will greatly impact the performance of the other. This may also be useful: http://www.datastax.com/docs/1.1/cluster_architecture/about_client_requests – Tyler Hobbs Jan 29 '13 at 04:35
  • Thanks Tyler, I was able to read up on the subject and this solves a lot of our problems. We will definitely try and proceed with this approach. – NG Algo Jan 29 '13 at 08:05