We have a 5 node cassandra cluster in our production. All running Cassandra 2.0.6. The cluster stores user interactions in pages in a column family. The data model looks like
Row Key:
20140101:http://example.com/myurlpath?myquery=1
Columns:
Counters
X:Y:Type => Counter Value
Since it is kind of stream of data points. We have a separate cron that actively deletes rows [remove all columns] that are more than n weeks old. Although our deletion cron empties older rows. The row keys still stay in our system [Ex: There is still a rowkey with timestamp 20130517].
I Checked SO Posts here and here Also cassandra forum There is no clear resolution out of the answers. I understand distributed deletes and tombstones. But this row keys issue remains still a myth for me.
I tried forcing a major compaction and a cleanup nothing changed things. Because of this memory used by our cassandra cluster is constantly increasing, as our row key sizes are high [120B on an average].
We have let gc_grace settings of column families stay the default 10 days. If it is the issue at least we should not see row keys older than an year [very frequently present] at max a month or two is fine.
How should we manage row key removal in cassandra?