0

I am generating Cassandra SSTables using the bulk loading sample provided at DataStax website. http://www.datastax.com/dev/blog/bulk-loading

My question is how much disk space is ideally consumed by the SSTable files ? In my case my data CSV file is 40 GB and the total disk space consumed by SStables for this specific file is around 250GB. Is there something that I am missing while creating these tables ? Are there any compression options available for generating sstables ?

The second step where I am loading the sstables using sstableloader works perfectly fine and data is available for querying in CQL.

Also, I would like to know if there are anyother techniques available to import large data into cassandra other than the bulkload method that I have mentioned above.

amey
  • 23
  • 4

1 Answers1

0

First of all check whether compression is enabled or not. How to check that?

If the sstable is compressed it will have a CompressionInfo.db component (i.e. one of the file composing the sstable with end with --CompressionInfo.db). If there is no such file then it's not compressed.

For further compression related information, check this.

Moving to last question there is other alternative to bulkload method, use COPY command. See documentation

abhi
  • 4,762
  • 4
  • 29
  • 49
  • Thanks Abhi. I understand the compression parameter that can be specified during creation of column family; but would this affect the way I am generating SStables using Cassandra IO API ? the only arguments required for generating sstables are the keyspace and column family name. Also, I have tried COPY command previously, but it requires the CSV to have a quoted structure something like "aa","bb" ... Can you share what techniques you are using to load data in Cassandra ? – amey May 15 '13 at 19:05
  • Yes it will definitely effect. Try that and No, no other mechanism to load except these two. – abhi May 15 '13 at 19:16
  • @amey i don't use bulk loading jobs. I prefer doing it manually, using threads, as i have to maintain lots of counter columns. – abhi May 17 '13 at 16:51
  • so do you use Hector/Astyanax to do bulk inserts ? How is the performance with that, in the sense how long does it take for like 10GB data ? – amey May 17 '13 at 18:01
  • depends on system spec, in my case i can write up to 2000 records/core/second, as my records are quite complex. Now you can imagine what will be the exact time to load 10 GB data – abhi May 17 '13 at 18:14
  • Thanks for the insights Abhi. Are you using any Cassandra client like Hector to write records ? – amey May 17 '13 at 20:44
  • i prefer astyanax. Recently thinking of migrating on datastax java driver after seeing its simple api – abhi May 18 '13 at 04:56