48

Is there a way to get a row count (key count) of a single column family in Cassandra? get_count can only be used to get the column count.

For instance, if I have a column family containing users and wanted to get the number of users. How could I do it? Each user is it's own row.

Henri Liljeroos
  • 511
  • 1
  • 4
  • 4

6 Answers6

40

If you are working on a large data set and are okay with a pretty good approximation, I highly recommend using the command:

nodetool --host <hostname> cfstats

This will dump out a list for each column family looking like this:

Column Family: widgets
SSTable count: 11
Space used (live): 4295810363
Space used (total): 4295810363
Number of Keys (estimate): 9709824
Memtable Columns Count: 99008
Memtable Data Size: 150297312
Memtable Switch Count: 434
Read Count: 9716802
Read Latency: 0.036 ms.
Write Count: 9716806
Write Latency: 0.024 ms.
Pending Tasks: 0
Bloom Filter False Postives: 10428
Bloom Filter False Ratio: 1.00000
Bloom Filter Space Used: 18216448
Compacted row minimum size: 771
Compacted row maximum size: 263210
Compacted row mean size: 1634

The "Number of Keys (estimate)" row is a good guess across the cluster and the performance is a lot faster than explicit count approaches.

Justin DeMaris
  • 856
  • 6
  • 5
  • 3
    Isn't the Number of Keys an estimate for that particular node, not the entire cluster? – Christopher Smith Jan 22 '13 at 01:59
  • Yes it is. The nodetool contacts the ColumnFamilyStoreMBean which uses ColumnFamilyStore, which uses DataTracker, which returns the sum of the estimates created by each local SSTable. – Martin Serrano Dec 22 '13 at 16:42
  • 3
    'nodetool --host cfstats' gives me https://gist.github.com/cdwijayarathna/e6b4d3d7d8c272fcfd24, but there is no attribute 'Number of Keys', what is the reason for that? – Chamila Wijayarathna Dec 12 '14 at 06:17
  • it seems they dropped this info in 2.0 https://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsCFstats.html – nils petersohn Mar 29 '15 at 14:32
6

I found an excellent article on this here.. http://www.planetcassandra.org/blog/post/counting-keys-in-cassandra

select count(*) from cf limit 1000000

Above statement can be used if we have an approximate upper bound known before hand. I found this useful for my case.

ajjain
  • 1,151
  • 2
  • 15
  • 28
  • 11
    Bad solution: on big tables you will got rpc_timeout, because it will scan all table – Vlad Jul 02 '14 at 00:58
6

If you are using an order-preserving partitioner, you can do this with get_range_slice or get_key_range.

If you are not, you will need to store your user ids in a special row.

jbellis
  • 19,347
  • 2
  • 38
  • 47
  • 3
    What if I have millions of rows? get_range_slice feels kind of heavy for a simple count operation. Am I right? – Henri Liljeroos Dec 27 '09 at 16:50
  • 4
    You are right, counting raw objects is relatively expensive in distributed systems compared to what you are used to on a system that can just keep an index in local memory. – jbellis Dec 28 '09 at 01:22
2

[Edit: This answer is out of date as of Cassandra 0.8.1 -- please see the Counters entry in the Cassandra Wiki for the correct way to handle Counter Columns in Cassandra.]

I'm new to Cassandra, but I have messed around a lot with Google's App Engine. If no other solution presents itself, you may consider keeping a separate counter in a platform that supports atomic increment operations like memcached. I know that Cassandra is working on atomic counter increment/decrement functionality, but it's not yet ready for prime time.

I can only post one hyperlink because I'm new, so for progress on counter support see the link in my comment below.

Note that this thread suggests ZooKeeper, memcached, and redis as possible solutions. My personal preference would be memcached.

http://www.mail-archive.com/user@cassandra.apache.org/msg03965.html

Ben Burns
  • 14,978
  • 4
  • 35
  • 56
0

There is always map/reduce but that probably goes without saying. If you have that with hive or pig, then you can do it for any table across the cluster though I am not sure tasktrackers know about cassandra locality and so it may have to stream the whole table across the network so you get task trackers on cassandra nodes but the data they receive may be from another cassandra node :(. I would love to hear if anyone knows for sure though.

NOTE: We are setting up map/reduce on cassandra mainly because if we want an index later, we can map/reduce one into cassandra.

Dean Hiller
  • 19,235
  • 25
  • 129
  • 212
-3

I have been getting the counts like this after I convert the data into a hash in PHP.

Philip Schlump
  • 3,070
  • 2
  • 20
  • 17
  • 3
    That clearly doesn't scale, as at some point the hash won't fit (usefully) into PHP's ram any more. Cassandra is for scalable stuff. – MarkR Dec 23 '09 at 17:45
  • I know - that is what concerns me. I am still in the development stage and looking for a better solution to problems like this. – Philip Schlump Dec 23 '09 at 19:30