22

Maybe it is a stupid question, but I'm not able to determine the size of a table in Cassandra.

This is what I tried:

select count(*) from articles;

It works fine if the table is small but once it fills up, I always run into timeout issues:

cqlsh:

  • OperationTimedOut: errors={}, last_host=127.0.0.1

DBeaver:

  • Run 1: 225,000 (7477 ms)
  • Run 2: 233,637 (8265 ms)
  • Run 3: 216,595 (7269 ms)

I assume that it hits some timeout and just aborts. The actual number of entries in the table is probably much higher.

I'm testing against a local Cassandra instance which is completely idle. I would not mind if it has to do a full table scan and is unresponsive during that time.

Is there a way to reliably count the number of entries in a Cassandra table?

I'm using Cassandra 2.1.13.

Philipp Claßen
  • 41,306
  • 31
  • 146
  • 239

7 Answers7

22

Here is my current workaround:

COPY articles TO '/dev/null';
...
3568068 rows exported to 1 files in 2 minutes and 16.606 seconds.

Background: Cassandra supports to export a table to a text file, for instance:

COPY articles TO '/tmp/data.csv';
Output: 3568068 rows exported to 1 files in 2 minutes and 25.559 seconds

That also matches the number of lines in the generated file:

$ wc -l /tmp/data.csv
3568068
Philipp Claßen
  • 41,306
  • 31
  • 146
  • 239
  • 2
    The COPY command is essentially doing a full cluster scan like COUNT(*) would, and this solution will not scale either. Like others have mentioned you'll have to settle for an approx row count to maintain speed and stability in your cluster. Options are. Counter columns, hyperloglog, or reading your tables metadata will give you and estimate of how many partitions there are. – fromanator Apr 20 '16 at 13:54
  • 3
    I'm OK with sacrificing scalability, as my intention was only to understand what is going on in my local development environment. I agree that it is not a solution that can be used in production. – Philipp Claßen Apr 20 '16 at 15:42
19

As far as I see you problem connected to timeout of cqlsh: OperationTimedOut: errors={}, last_host=127.0.0.1

you can simple increase it with options:

 --connect-timeout=CONNECT_TIMEOUT
                       Specify the connection timeout in seconds (default: 5
                       seconds).
 --request-timeout=REQUEST_TIMEOUT
                       Specify the default request timeout in seconds
                       (default: 10 seconds).
Oleksandr Petrenko
  • 633
  • 1
  • 5
  • 11
  • Thanks! Yes, with an increased timeout, SELECT returns the expected number of elements. – Philipp Claßen Apr 21 '16 at 11:25
  • Increasing both timeouts didn't work for me :/ I set both to 100 seconds, but `cqlsh` failed after 16 seconds with `ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses]`. Using `COPY TO` per [@PhilippClaßen's answer](https://stackoverflow.com/a/36745042/1727828) was the only way I could compute the row count. – mxxk Sep 12 '17 at 07:50
  • Increasing both timeouts did not work for me either. I also used COPY tablename TO '/dev/null'; to get the count. Maybe there is specific format for providing --connect-timeout and --request-timeout? – donlys Apr 25 '18 at 22:22
13

Is there a way to reliably count the number of entries in a Cassandra table?

Plain answer is no. It is not a Cassandra limitation but a hard challenge for distributed systems to count unique items reliably.

That's the challenge that approximation algorithms like HyperLogLog address.

One possible solution is to use counter in Cassandra to count the number of distinct rows but even counters can miscount in some corner cases so you'll get a few % error.

doanduyhai
  • 8,712
  • 27
  • 26
4

This is a good utility for counting rows that avoids the timeout issues that happen when running a large COUNT(*) in Cassandra:

https://github.com/brianmhess/cassandra-count

Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
Kat
  • 337
  • 2
  • 3
1

The reason is simple:

When you're using:

SELECT count(*) FROM articles;

it has the same effect on the database as:

SELECT * FROM articles;

You have to query over all your nodes. Cassandra simply runs into a timeout.

You can change the timeout, but it isn't a good solution. (For one time it's fine but don't use it in your regular queries.)

There's a better solution: make your client count your rows. You can create a java app where you count your rows, when you inserting them, and insert the result using a counter column in a Cassandra table.

Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
Citrullin
  • 2,269
  • 14
  • 29
0

You can use copy to avoid cassandra timeout usually happens on count(*)

use this bash

cqlsh -e "copy keyspace.table_name (first_partition_key_name) to '/dev/null'" | sed -n 5p | sed 's/ .*//'

Shubham
  • 287
  • 2
  • 7
0

You can define timeout seconds as a command line option while using Datastax cqlsh. The default value is 10.

$ cqlsh --request-timeout=3600
Usman Khawaja
  • 809
  • 14
  • 28