0

We need to load several gigabytes of csv files into cassandra. We tried ingesting data using source command to pull data from text files that contain insert statements with data values of the csv files.

With this approach, the data is not getting uploaded correctly - data from the first row is repeated in all the subsequent rows. (I have checked the insert commands and they seem to contain the right values).

What could be the issue? Am I seeing the rows are duplicates because it takes time for Cassandra to flush the data to disks? (nodetool shows no pending flushes though.)

Is it more efficient to create CSV files and use the copy statement to ingest the data? pls. advise.

Mogsdad
  • 44,709
  • 21
  • 151
  • 275
tagsense
  • 81
  • 2
  • 8

1 Answers1

2

Copy is usually used for smaller amounts of data. The recommended approach is to use SSTable Loader and create SSTable files from your data. This is a bit more work, but should result in quicker ingestion. You could also try using Spark and ingest into Cassandra via Spark.

As for inconsistencies, Cassandra does upserts based on primary keys. If more than one row matches the same primary key, the last write wins. If you need to keep all rows, perhaps add a timestamp or timeuuid column to the primary key to make records unique.

ashic
  • 6,367
  • 5
  • 33
  • 54