4

I want to know how many bytes are exactly stored on disk when I insert a new column in a Column Family of Cassandra. My main problem is that I need to know this information when columns are compressed with Snappy, I know the calculation of raw bytes but, due to the variability of the data, I can not properly approximate the compression ratio. Any information about where to find this amount of bytes in the Cassandra codebase will welcome.

Thanks in advance.

Amanda
  • 941
  • 2
  • 12
  • 28

1 Answers1

2

Compression can never give guaranteed compression ratios. The best you can get is an average ratio for sample data.

So get a load of sample data, insert it into a test instance, and measure the disk usage.

You might have data that compresses very poorly with Snappy and actually results in more on-disk usage than storing raw bytes.

When it comes to compression of your data there is one and only one rule: MEASURE

Stephen Connolly
  • 13,872
  • 6
  • 41
  • 63
  • Stephen, I've been testing to measure compression and, indeed, there are certain columns that make the disk usage is lower because RLE compression. You have confirmed what I thought. I guess the only option is to use statistical measures, because I believe that the compression only occur when Cassandra flushes, isn't it? Thank you again. – Amanda Nov 26 '12 at 11:12