0

I have Cassandra cluster with three nodes. I want to perform clean up task on particular table which consist around 1 TB of data in each node.

Table Structure

Table consist multiple rows of subscriber with date_created(timestamp), I want to clear all rows of subscriber for which latest entry(date_created) is older than 7 Days.

For example - in sample attached above data we have to delete all rows subscriber 2 whereas all rows of subscriber 2 will be preserved if runs of 2019-12-10.

We have around 10 M of subscriber, In order to get total number of records select count(*) query is throwing below exception

ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}

What approach I should use to read Cassandra, since this cant be perform in memory as huge data of 3 TB.

TheCurious
  • 593
  • 1
  • 4
  • 29

1 Answers1

0

Your count query is performing a full table scan. This is why it times out. You need to delete your data using your partition/ clustering keys.

Alex Tbk
  • 2,042
  • 2
  • 20
  • 38
  • We havn't any other source to get identifier to whom we have to drop, first we have to read all data, and delete it followed by manipulation.What should be our criteria to read cassendra. – TheCurious Oct 09 '19 at 09:45
  • Then your table structure is wrong, this is why you are running into this issue. – Alex Tbk Oct 09 '19 at 09:48
  • updating table structure into question. – TheCurious Oct 09 '19 at 09:59
  • You could add a table where you store the known profile_ID of a subscriber. Then iterate: select from table where subscriber_id =a and profile_id=b and timestamp – Alex Tbk Oct 09 '19 at 10:29
  • This could cause us 10M select queries .Is there any way to mitigate it. – TheCurious Oct 09 '19 at 10:31
  • You didnt tell, which other queries are used with this table. You could change the key to (subscriber_id, timestamp, profile_id) then its: select from table where subscriber_id =a and timestamp – Alex Tbk Oct 09 '19 at 10:36
  • we have indexing like this- PRIMARY KEY ((subscriber_id, profile_id), timestamp).....(subscriber_id, profile_id) belongs to same partition. My doubt/query is - how can we minimize 10M select query since we have 10M subscribers. – TheCurious Oct 09 '19 at 11:09
  • Having 10M queries should be not a problem, small queries perform well. – Alex Tbk Oct 09 '19 at 11:16