4

I created a Cassandra column-family and I need to load data from a CSV file for this column family. The csv file has a 15 Gb volume.

I am using the CQL 'COPY FROM' command but this takes a long time to make loading the data. What is the best/simplest way to load large amounts of data to Cassandra from csv files?

treehouse
  • 2,521
  • 1
  • 21
  • 39
Pedro Cunha
  • 401
  • 1
  • 6
  • 16

2 Answers2

6

The CQLSH built-in copy to/from CSV files is pretty simple and is intended for small to moderate sized data sets. You didn't mention which Cassandra version you're using, but there were a lot of performance improvements made in 2.1.5 (CASSANDRA-8225).

An alternative tool that has had good results for larger data is cassandra-loader. You could try that with a subset of your file (like 1000 rows) to confirm it works, then try with your whole file to see the performance.

BrianC
  • 10,591
  • 2
  • 30
  • 50
  • I am using the Cassandra 2.2.3. Thanks BrianC, I will test load data with cassandra-loader and check the performance... – Pedro Cunha Oct 28 '15 at 17:33
4

Use sstableloader. Check out this blog post. You need to parse your CSV file into sstables with the same C* schema and bulk load them into C*.

treehouse
  • 2,521
  • 1
  • 21
  • 39
  • 2
    sstableloader is the right answer for total raw speed, but may be overkill for a 15g file. sstableloader uses the bulk load interface, so you generate sstables in advance and stream them into the system as data files, not as individual mutations. This is MUCH faster, but does require that you make the sstables in advance. – Jeff Jirsa Oct 29 '15 at 06:19