How to count the number of row keys for a particular column_family in Cassandra (read details)

Question

I am trying to load data from SQL to No-SQL i.e Cassandra. but somehow few rows are not matching. Can somebody tell me how to count the number of row keys for a particular column_family in Cassandra.

I tried get_count and get_multicount, but these methods require keys to passed, In my case i do not know the keys, Instead I need the row count of the row_keys. list column_family_name gives me the list but limited to only 100 rows. is there any way, I can override the 100 limit.

As far as I know, there is no way to get a row count for a column family. You have to perform a range query over the whole column family instead. — jterrace, Nov 21 '11 at 22:17
@jterrace Thanks, can you please elaborate about performing the range query? (with a example preferably) — Nish, Nov 21 '11 at 22:26
All i want is, how can I use the corresponding SQL query in cassandra. "select count(row_key) from table_name" — Nish, Nov 21 '11 at 22:26

score 1 · Accepted Answer · answered Nov 21 '11 at 22:55

1

As far as I know, there is no way to get a row count for a column family. You have to perform a range query over the whole column family instead.

If cf is your column family, something like this should work:

num_rows = len(list(cf.get_range()))

However, the documentation for get_range indicates that this might cause issues if you have too many rows. You might have to do it in chunks, using start and row_count.

answered Nov 21 '11 at 22:55

jterrace

64,866
22
157
202

Thanks, but it takes a very long time to execute. Still figuring out how cassandra(NOSQL) is efficient to retrieve data. I do agree its faster to insert millions of records at once, but not retrieval! :( – Nish Nov 21 '11 at 23:55
are you using randompartitioner? – jterrace Nov 22 '11 at 00:42
you have to use a partitioner, even with one node – jterrace Nov 22 '11 at 02:16
1

If you set column_count=0 and filter_empty=False in get_range(), this will give you just the keys back. Additionally, get_range returns a generator, so you can do something like "for key, _ in get_range(): count += 1" so that you're not pulling the entire result into a list at once. There's no need to use 'start' and 'row_count' if you do this; pycassa will chunk the requests automatically. – Tyler Hobbs Nov 22 '11 at 05:29
@TylerHobbs. Thanks. Count was helpful, But i found few keys missing. To get the keys back, i set column_count=0 but could not find filter_empty to set it False. I get this `TypeError: get_range() got an unexpected keyword argument 'filter_empty'` Any clues..? – Nish Nov 22 '11 at 23:00
1

@Nish filter_empty was only added in pycassa 1.3. You might have an older version? – jterrace Nov 22 '11 at 23:41
In that case you should either upgrade to 1.3 as jterrace suggests, or set column_count to 1 instead of 0. – Tyler Hobbs Nov 24 '11 at 03:48

score 0 · Answer 2 · answered Mar 17 '16 at 16:06

0

You can count Cassandra rows without reading all rows.

See the implementation in Spark for cassandraCount() which does this quite efficiently.

answered Mar 17 '16 at 16:06

Joseph Lust

19,340
7
85
83

How to count the number of row keys for a particular column_family in Cassandra (read details)

2 Answers2