Row count of a column family in Cassandra

Question

Is there a way to get a row count (key count) of a single column family in Cassandra? get_count can only be used to get the column count.

For instance, if I have a column family containing users and wanted to get the number of users. How could I do it? Each user is it's own row.

score 40 · Answer 1 · answered Jan 21 '13 at 21:04

40

If you are working on a large data set and are okay with a pretty good approximation, I highly recommend using the command:

nodetool --host <hostname> cfstats

This will dump out a list for each column family looking like this:

Column Family: widgets
SSTable count: 11
Space used (live): 4295810363
Space used (total): 4295810363
Number of Keys (estimate): 9709824
Memtable Columns Count: 99008
Memtable Data Size: 150297312
Memtable Switch Count: 434
Read Count: 9716802
Read Latency: 0.036 ms.
Write Count: 9716806
Write Latency: 0.024 ms.
Pending Tasks: 0
Bloom Filter False Postives: 10428
Bloom Filter False Ratio: 1.00000
Bloom Filter Space Used: 18216448
Compacted row minimum size: 771
Compacted row maximum size: 263210
Compacted row mean size: 1634

The "Number of Keys (estimate)" row is a good guess across the cluster and the performance is a lot faster than explicit count approaches.

answered Jan 21 '13 at 21:04

Justin DeMaris

856
6
5

3

Isn't the Number of Keys an estimate for that particular node, not the entire cluster? – Christopher Smith Jan 22 '13 at 01:59
Yes it is. The nodetool contacts the ColumnFamilyStoreMBean which uses ColumnFamilyStore, which uses DataTracker, which returns the sum of the estimates created by each local SSTable. – Martin Serrano Dec 22 '13 at 16:42
3

'nodetool --host cfstats' gives me https://gist.github.com/cdwijayarathna/e6b4d3d7d8c272fcfd24, but there is no attribute 'Number of Keys', what is the reason for that? – Chamila Wijayarathna Dec 12 '14 at 06:17
it seems they dropped this info in 2.0 https://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsCFstats.html – nils petersohn Mar 29 '15 at 14:32

score 6 · Answer 2 · answered May 28 '13 at 11:41

6

I found an excellent article on this here.. http://www.planetcassandra.org/blog/post/counting-keys-in-cassandra

select count(*) from cf limit 1000000

Above statement can be used if we have an approximate upper bound known before hand. I found this useful for my case.

answered May 28 '13 at 11:41

ajjain

1,151
2
15
28

11

Bad solution: on big tables you will got rpc_timeout, because it will scan all table – Vlad Jul 02 '14 at 00:58

score 6 · Answer 3 · answered Dec 23 '09 at 15:05

6

If you are using an order-preserving partitioner, you can do this with get_range_slice or get_key_range.

If you are not, you will need to store your user ids in a special row.

answered Dec 23 '09 at 15:05

jbellis

19,347
2
38
47

3

What if I have millions of rows? get_range_slice feels kind of heavy for a simple count operation. Am I right? – Henri Liljeroos Dec 27 '09 at 16:50
4

You are right, counting raw objects is relatively expensive in distributed systems compared to what you are used to on a system that can just keep an index in local memory. – jbellis Dec 28 '09 at 01:22

Ben Burns · Answer 4 · 2012-10-01T15:33:57.460

2

[Edit: This answer is out of date as of Cassandra 0.8.1 -- please see the Counters entry in the Cassandra Wiki for the correct way to handle Counter Columns in Cassandra.]

I'm new to Cassandra, but I have messed around a lot with Google's App Engine. If no other solution presents itself, you may consider keeping a separate counter in a platform that supports atomic increment operations like memcached. I know that Cassandra is working on atomic counter increment/decrement functionality, but it's not yet ready for prime time.

I can only post one hyperlink because I'm new, so for progress on counter support see the link in my comment below.

Note that this thread suggests ZooKeeper, memcached, and redis as possible solutions. My personal preference would be memcached.

http://www.mail-archive.com/user@cassandra.apache.org/msg03965.html

edited Oct 01 '12 at 15:33

answered Aug 29 '10 at 21:36

Ben Burns

14,978
4
35
56

2

https://issues.apache.org/jira/browse/CASSANDRA-1072 for progress on Cassandra counter support. – Ben Burns Aug 29 '10 at 21:37
CounterColumn is now supported in 0.8.1 and beyond – Bamboo Oct 01 '12 at 08:46
Thanks, Bamboo. This is a pretty old answer and my work with Cassandra ceased sometime in early 2011. I'll update with a link, though. – Ben Burns Oct 01 '12 at 12:52

score 0 · Answer 5 · answered Nov 14 '11 at 23:51

There is always map/reduce but that probably goes without saying. If you have that with hive or pig, then you can do it for any table across the cluster though I am not sure tasktrackers know about cassandra locality and so it may have to stream the whole table across the network so you get task trackers on cassandra nodes but the data they receive may be from another cassandra node :(. I would love to hear if anyone knows for sure though.

NOTE: We are setting up map/reduce on cassandra mainly because if we want an index later, we can map/reduce one into cassandra.

score -3 · Answer 6 · answered Dec 23 '09 at 14:41

-3

I have been getting the counts like this after I convert the data into a hash in PHP.

answered Dec 23 '09 at 14:41

Philip Schlump

3,070
2
20
17

3

That clearly doesn't scale, as at some point the hash won't fit (usefully) into PHP's ram any more. Cassandra is for scalable stuff. – MarkR Dec 23 '09 at 17:45
I know - that is what concerns me. I am still in the development stage and looking for a better solution to problems like this. – Philip Schlump Dec 23 '09 at 19:30

Row count of a column family in Cassandra

6 Answers6

Linked