1

I have read that Columnar databases are apt for Aggregate Queries and Cassandra is a columnar database. I am trying to use count( values 'between' or '>=' for a specific partition) in Cassandra. Is this performance intensive?

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
Sahil Ummer
  • 455
  • 1
  • 2
  • 9

3 Answers3

1

Cassandra is a partitioned row store. Data is stored in partitions, clustered together and served as "rows." It is not a columnar database.

An aggregate query to run a count will not perform well on Cassandra. To attempt it will be performance intensive, right up until the coordinator node times-out the query.

If this is a use case you need to solve for, another database will be the better option.

Aaron
  • 55,518
  • 11
  • 116
  • 132
1

Adding to @aaron's response, if you're performing an aggregate operation just within your partition, that might be okay. For example,

Let's assume your table schema is as follows:

CREATE TABLE IF NOT EXISTS keyspace_name.table_name (
 partition_key1 some_type,
 partition_key2 some_type,
 clustering_key1 some_type,
 clustering_key2 some_other_type,
 regular_column1 some_type,
 ...
 regular_columnN some_type,
 PRIMARY KEY ((partition_key1, partition_key2), clustering_key1, clustering_key2)
) WITH CLUSTERING ORDER BY(ck1 DESC, ck2 DESC)
AND ...;

it may be okay to do aggregation queries such as the following to be performant,

SELECT COUNT(some_regular_column) FROM keyspace_name.table_name WHERE partition_key1 = ? AND partition_key2 = ? AND clustering_key1 >= ? AND clustering_key2 <= ?;
Madhavan
  • 758
  • 4
  • 8
1

It's a common misconception that Cassandra is a columnar database. I think it comes from the old terminology "column family" for tables. Data is stored in rows containing columns of key-value pairs which is why the tables used to be called column families.

A major difference compared to traditional relational databases is that Cassandra tables can be 2-dimensional (each record contains exactly one row) or multi-dimensional (each record can contain ONE OR MORE rows).

On the other hand, columnar databases flips a 2-dimensional table such that data is stored in columns instead of rows, specifically optimised for analytics-type queries such as aggregations -- this is NOT Cassandra.

Going back to your question, counting the rows within a single partition is ok to do for most data models. The key is to restrict the query to just one partition like:

    SELECT COUNT(some_column) FROM table_name
        WHERE pk = ?

It's also OK to count the rows in a range query as long as they're restricted to one partition like:

    SELECT COUNT(some_column) FROM table_name
        WHERE pk = ?
        AND clustering_col >= ?
        AND clustering_col <= ?

If you don't restrict the query to a single partition, it might work for (a) very small datasets and (b) clusters with a very low number of nodes but it doesn't scale as (c) the dataset grows, and (d) the number of nodes increases. I've explained why performing aggregates such as COUNT() is bad in Cassandra in this post -- https://community.datastax.com/questions/6897/.

This is not to say that Cassandra isn't a good fit. Cassandra is a good choice if your primary use case is for storing real-time data for OLTP workloads. For analytics queries, you just need to use other software like Apache Spark since the spark-cassandra-connector will optimise the queries to Cassandra. Cheers!

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23