0

I have a Cassandra schema like this:

    CREATE TABLE wide_data (
    index int,
    id text,
    code text,
    created_at timestamp,
    data text,
    PRIMARY KEY (index, id, code)
    ) WITH CLUSTERING ORDER BY (id ASC, code ASC)

I use this table for processing something once a week. My data generally has 4-5 million columns with only couple of indexes(my partition key), which means I only have a couple of partition keys and my data goes to only 1-2 nodes.

I generally add this data and then delete this data within an hour after processing. When I add new data the next time after a week my heap usage increases a lot and node goes down sometimes. Now I have a couple of questions:

  1. Is this happening because my gc_grace_seconds is 10 days and when I'm fetching data the second time Cassandra is also looking at old rows with tombstones?
  2. Is there a better way to model my data? given that I don't have a list of id or code with me.

I know I should not be using Cassandra as a queue because a partition more than 100Mb will start causing heap problems, but I would be happy if I could leverage Cassandra for this purpose.

Heisenberg
  • 5,514
  • 2
  • 32
  • 43
  • how often do you do repairs? how long do the repairs take? – Chris Lohfink Dec 06 '17 at 06:13
  • 1
    might be useful to look at https://stackoverflow.com/a/37191777/266337 – Chris Lohfink Dec 06 '17 at 06:17
  • I checked and our repairs fail often. This linked helped. – Heisenberg Dec 06 '17 at 06:47
  • We had a similar experience with the hints table in 2.2. We fixed it by increasing column_index_size_in_kb in casssandra.yaml. When doing compaction C* will store all indexes is on the heap to help search through the SSTables. We changed this from 64 to 256 kb. – Simon Fontana Oscarsson Dec 06 '17 at 09:07
  • Thanks for your suggestion. But I guess my main problem is tombstones. – Heisenberg Dec 06 '17 at 10:22
  • also might be useful https://stackoverflow.com/questions/36240706/is-it-possible-to-avoid-tombstone-problems-with-cassandra – Mikhail Baksheev Dec 06 '17 at 14:03
  • Generally the point with Cassandra is fast lookups after insertion. So the question you should be asking is `how am I using this data` and from that what the lookup queries would be. Without that info it is impossible to make recommendations. I'd start with removing everything in the schema that is not used on lookup and go from there. – danny Dec 12 '17 at 15:45

0 Answers0