3

I am trying to load data from SQL to No-SQL i.e Cassandra. but somehow few rows are not matching. Can somebody tell me how to count the number of row keys for a particular column_family in Cassandra.

I tried get_count and get_multicount, but these methods require keys to passed, In my case i do not know the keys, Instead I need the row count of the row_keys. list column_family_name gives me the list but limited to only 100 rows. is there any way, I can override the 100 limit.

animuson
  • 53,861
  • 28
  • 137
  • 147
Nish
  • 650
  • 1
  • 8
  • 14
  • As far as I know, there is no way to get a row count for a column family. You have to perform a range query over the whole column family instead. – jterrace Nov 21 '11 at 22:17
  • @jterrace Thanks, can you please elaborate about performing the range query? (with a example preferably) – Nish Nov 21 '11 at 22:26
  • All i want is, how can I use the corresponding SQL query in cassandra. "select count(row_key) from table_name" – Nish Nov 21 '11 at 22:26

2 Answers2

1

As far as I know, there is no way to get a row count for a column family. You have to perform a range query over the whole column family instead.

If cf is your column family, something like this should work:

num_rows = len(list(cf.get_range()))

However, the documentation for get_range indicates that this might cause issues if you have too many rows. You might have to do it in chunks, using start and row_count.

jterrace
  • 64,866
  • 22
  • 157
  • 202
  • Thanks, but it takes a very long time to execute. Still figuring out how cassandra(NOSQL) is efficient to retrieve data. I do agree its faster to insert millions of records at once, but not retrieval! :( – Nish Nov 21 '11 at 23:55
  • are you using randompartitioner? – jterrace Nov 22 '11 at 00:42
  • you have to use a partitioner, even with one node – jterrace Nov 22 '11 at 02:16
  • 1
    If you set column_count=0 and filter_empty=False in get_range(), this will give you just the keys back. Additionally, get_range returns a generator, so you can do something like "for key, _ in get_range(): count += 1" so that you're not pulling the entire result into a list at once. There's no need to use 'start' and 'row_count' if you do this; pycassa will chunk the requests automatically. – Tyler Hobbs Nov 22 '11 at 05:29
  • @TylerHobbs. Thanks. Count was helpful, But i found few keys missing. To get the keys back, i set column_count=0 but could not find filter_empty to set it False. I get this `TypeError: get_range() got an unexpected keyword argument 'filter_empty'` Any clues..? – Nish Nov 22 '11 at 23:00
  • 1
    @Nish filter_empty was only added in pycassa 1.3. You might have an older version? – jterrace Nov 22 '11 at 23:41
  • In that case you should either upgrade to 1.3 as jterrace suggests, or set column_count to 1 instead of 0. – Tyler Hobbs Nov 24 '11 at 03:48
0

You can count Cassandra rows without reading all rows.

See the implementation in Spark for cassandraCount() which does this quite efficiently.

Joseph Lust
  • 19,340
  • 7
  • 85
  • 83