I have a Cassandra schema like this:
CREATE TABLE wide_data (
index int,
id text,
code text,
created_at timestamp,
data text,
PRIMARY KEY (index, id, code)
) WITH CLUSTERING ORDER BY (id ASC, code ASC)
I use this table for processing something once a week. My data generally has 4-5 million columns with only couple of indexes(my partition key), which means I only have a couple of partition keys and my data goes to only 1-2 nodes.
I generally add this data and then delete this data within an hour after processing. When I add new data the next time after a week my heap usage increases a lot and node goes down sometimes. Now I have a couple of questions:
- Is this happening because my
gc_grace_seconds
is 10 days and when I'm fetching data the second time Cassandra is also looking at old rows with tombstones? - Is there a better way to model my data? given that I don't have a list of
id
orcode
with me.
I know I should not be using Cassandra as a queue because a partition more than 100Mb will start causing heap problems, but I would be happy if I could leverage Cassandra for this purpose.