3

I have a very huge Cassandra table with about 13 million entries. This table serves as a kind of a lookup table. That means there are no writes but only reads. I use Datastax Enterprise 4.8 (including Cassandra 2.1).

So, the content is very static, but from time to time (every few month) there is an update of the content. The problem is, that the old data can become outdated and new data appears. But the old data won't be overwritten (it stays in the table). It is necessary to remove the old data to have a clean database.

I have one requirement ... the database must be available during the update. It is okay to have a short time period (a few minutes) where old and new data exists side by side.

I already thought about the following solutions:

  • Write the new table directly as a SSTable and exchange it with the old one
  • Do the update as batch with an truncate of the old data at the beginning
  • Create a new table (with new name) and change the used table in the program (while running)
  • Add a version column, add new data with new version and delete old data (with old version) afterwards

Which of these solution is the best one? Or even better, is there a solution that solves my problem much more elegant?

Josef Bauer
  • 199
  • 1
  • 1
  • 13
  • Do you mean the data must be available? Because truncating and then inserting will work without your database going down. – Dennis Hunziker Feb 16 '16 at 10:28
  • Sorry, forgot to mention. Yes, the data must be available (either the old or the new one). The problem is, if I truncate the table, it takes about 30 minutes till the new data is inserted. This is more or less a downtime of 30 minutes. – Josef Bauer Feb 16 '16 at 10:39
  • Would it be acceptable to create a new table using a version number in the name and write the new data to that table and then switch your applications/users to using that table? – mikea Feb 16 '16 at 12:22
  • Is the update a brand new version of the table? – phact Feb 16 '16 at 13:20
  • @mikea this is more or less my third solution. I have also a clustered application (in addition to the clustered Cassandra). So, I have also to implement a mechanism to inform all my applications that the table changed. That is what I don't like at this idea. – Josef Bauer Feb 16 '16 at 13:45
  • @phact yes, the new update is a new version of the table. All information I need is inside this update. – Josef Bauer Feb 16 '16 at 13:46
  • I would go for the version column. Add that as the last clustering key and create a secondary index for it. – Ralf Feb 17 '16 at 12:14
  • @Ralf this is also my preferred solution. As far as I have some findings I will update my question. – Josef Bauer Feb 17 '16 at 18:25

1 Answers1

1

Okay, after a lot of testing, here are my findings. All the mentioned measurements are based on 13 million datasets.

Write own SSTable

I have written a small Java tool that creates SSTables. Here you can find a good example how to do this with the CQLSSTableWriter. After the creation of the SSTable I have used the sstableloader command line tool (comes with Cassandra) to import it into Cassandra.

Conclusion

  • the creation of the SSTable goes pretty quick (~ 10 minutes)
  • the import of the SSTable is very slow (~ 6 hours)
  • you have to take care to you use the exact same java library version (cassandra-all.jar) then your Cassandra version, otherwise it can happen that the created SSTable is incompatible with Cassandra

Import with CQL and version column

I have written a small Java tool that executes CQL commands to insert the datasets into Cassandra. Additionally, I added a version column, so after the import, I can remove the old data. The downside is, that my only partition key is the version itself, so I can remove old datasets easily. To workaround this, I indexed the table with Solr and use Solr queries to search in that table. The fact, that the data is not distributed between single nodes is okay for us, the search still works like a charm. At least the data is replicated between several nodes.

Conclusion

  • the duration of the import is ok (~ 1.5 hours)
  • the load of the Cassandra nodes goes up heavily, I still have to investigate how this influences the experience of the "normal users" (but a quick check shows that this is still fine)

Result

I will use the second solution because it is faster and you don't have to take care of the correct library versions. In all my tools I use threading, so here I also have a big adjusting screw to find the best balance between concurrency and threading overhead. At the end I use a low number of threads in my tool (~8) but the executeAsync method of the Datastax Java driver.

Josef Bauer
  • 199
  • 1
  • 1
  • 13