1

I know Cassandra count() is an expensive operation as it needs a complete table scan. https://www.datastax.com/blog/running-count-expensive-cassandra

But let's say, we have a table hotel with hotel_type as the partition key and we run query

select count(*) from hotel where hotel_type= 'luxury';

Will this be expensive too? Actually, I need to run 1 million queries like this to get the count of different hotel_types. Will finding count can impact the prod Cassandra cluster?

Update:

I saw we can use dsbulk for the counting. How dsbulk count is different from cql count()?

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23

2 Answers2

2

The main concept of Cassandra modelization is the Primary Key, who determine where is the data physically on the Cluster

When the primary key is used with an a count() function, the coordinator node (who handle the request) will retrieve all associated data locally and then apply the aggregation operation locally

As this request filter by Primary Key and not need to scan all data of the table accros the Cluster, you can use it

However, you need to keep in mind that will induce more ressource consumption on the the coordinator nodes and a bench will be helpful before going to production

Hakan
  • 126
  • 5
  • Thanks, @Hakan for your reply. I am thinking about sending 1 million count requests. Will I need to take more precautions for that? – Purushottam Baghel Mar 25 '23 at 15:08
  • I want to say ‘it depends‘ (configuration of the cluster/nodes, is it a mutualised cluster ?, is there a preferable time slot where clients requests decrease ?, estimation of average/max number of partitions by Primary Key) => a bench will be very helpful/recommended. Also do you plan to execute this queries regularly (each day/month/…) ? – Hakan Mar 26 '23 at 08:47
  • As of now I just want to execute queries one time – Purushottam Baghel Mar 27 '23 at 03:45
2

You are correct that an unbounded COUNT() is an expensive operation in Cassandra for the reasons I've explained in this post -- https://dba.stackexchange.com/questions/314567/.

But counting the rows of a single partition (i.e. SELECT query is restricted with WHERE pk = ?) is a valid use case for the CQL COUNT() function.

Additionally, it should be fine executing a million read operations asynchronously provided you throttle the requests to a throughput your cluster can handle.

The count command in the DataStax Bulk Loader (DSBulk) tool is one the recommended ways for running a full table count because DSBulk optimises the operation by breaking it up into small token segments then fires off multiple distributed requests instead of a single full table scan.

If you're interested, here are some references with examples for using DSBulk:

Cheers!

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
  • But @erick I am getting Count query timed out for query on a single partition. This hotel_type has 1 million rows. ``` select count(*) from hotel where hotel_type= 'luxury'; ``` table ``` CREATE TABLE hotel( hotel_type text, hotel_id text, PRIMARY KEY (hotel_type, hotel_id) ) WITH CLUSTERING ORDER BY (hotel_id DESC) ``` – Purushottam Baghel Apr 01 '23 at 04:36
  • Then the timeout is expected because it would take too long to read all those rows from the disk. You need to limit the size of the query by specifying clustering columns in the filter. Cheers! – Erick Ramirez Apr 05 '23 at 06:01