1

I'm implementing an app, which is generating hundreds of thousands rows in 4 threads. Each thread opens a separate connection to cassandra.

Every item of the table has a unique hash identifier (String), but the primary key is an uuid.

The process of the item persisting is the following:

1) The item is created and its hash is computed. 2) Then a lookup for the hash is being executed in a second table, which pairs hashes accordingly to the item's uuids. 3) If a hash - uuid pair is found, a lookup for the items uuid is being executed (1st table again) and since the item has to exist (because a "hash - uuid" pair was found), the item is loaded from cassandra to JPA and it's updated afterwards. When no "hash - uuid" pair is found, a new item is created in the corresponding table and a new "hash - uuid" pair is saved as well.

The data generation has two steps. The first step is running with empty tables and generates the first datasets. No errors happen there, because in the step nr. 3, a "hash - uuid" pair is never found, so no updates occur.

In the second step, the whole algorithm runs again, but already on populated data tables. In this step, random errors occur while reading the data items byt their correspnding uuids (primary keys) - sometimes the server doesn't retun complete text data (proper JSON strings are stored in the table, but incomplete JSON strings are retrieved into the application).

I'm completely sure, that my algorithm is correct, because the same algorithem worked with hibernate and mysql, even with postgresql (but since I need faster writes, I'm playing around with cassandra).

I am using a macbook pro with 16 GB RAM, for the work with cassandra I use the Kundera library (supports JPA). As for cassandra, I have tried the datastax 2.0.4 version, and also the 2.0.7 version downloaded directly from the Apache site. There is no cluster, only one instance is running locally on my machine, on an external SSD drive. Kundera is using CQL v3.

Has anybody an idea, how this behaviour could occur? Is there a bug in the datastax cassandra driver or in Kundera? Or am I using cassandra wrong and the database shouldn't be used this way? Or are there any configuration tweaks which I might have forgotten?

The only thing I have changed in the cassandra configuration file are all the timeouts, because I was getting too many TimeoutExceptions with the default values (the timeouts occured during primary key lookups)

rastusik
  • 123
  • 9
  • if anybody would be interested, here is more detailed description about the problem: https://github.com/impetus-opensource/Kundera/issues/587 – rastusik May 08 '14 at 09:59

1 Answers1

1

I suspect your code is not using the Cassandra connections in a threadsafe manner: care must be taken to only allow one thread to access a connection at a time. I do not know how Kundera approaches this, because JPA will generate incredibly inefficient queries for Cassandra and I do not recommend it. See the data modeling resources here, and use the native CQL java driver.

jbellis
  • 19,347
  • 2
  • 38
  • 47
  • thanks for the answer, that's exactly what I'm suspecting. There is no problem with query generation, since even with JPA, the queries are only basic primary key lookups. I suspect that the problem lays in Kundera, I was just asking to be sure, that I'm not doing any data manipulation which is not supposed to be handled by cassandra in the way I'm doing it. – rastusik May 08 '14 at 09:58