Observed data loss while fetching records using Cassandra JDBC driver

Question

Table Structure : testkeyspace.productInfo (productID, productName, timestamp, price) - Partition key column in the above table is “productId”. - table -> testkeyspace.productInfo consists of 10k records

Test Case

Test consists of two Java applications
1. Application 1 -> Writes data to Cassandra table (testkeyspace.productInfo)
2. Application 2 -> Reads from Cassandra table (testkeyspace.productInfo) continuously with 5 seconds interval
Partition Key (productID) is unique for all records. Hence, Number of partitions = Number of records in table

How data is read from Cassandra table? (Read Pattern)

Initially, "SELECT *" is issued to Cassandra table with the help of Cassandra JDBC
Latest Timestamp (lastNotedTimestamp) of the fetched records is noted from the timestamp column of the table
The subsequent "SELECT *" are issued with a WHERE condition of timestamp > lastNotedTimestamp
Step 2 and 3 are repeated until application kill.

Issue

While Step 4 is executed in parallel with the write application, few random records are missing in the ResultSet returned from the JDBC call.
While Step 4 is executed after write application has completed its task, records are not missed and the fetching is successful.

Note: Above mentioned issue is applicable even when using a numeric-based column in the WHERE clause of the select query.

It's tested and observed in the single node cluster.

Can someone explain this behaviour and let me know why this is happening?

Thanks in advance.

Please consider providing additional and complete details because it's missing crucial information that is required to perform basic triage. Update it with (1) keyspace definition [Run `DESCRIBE KEYSPACE testkeyspace;`], (2) full nodetool `SELECT` commands you're running, (3) any code snippet(s), (4) full error message, (5) full stack trace, and (6) Cassandra & JDBC version. Cheers! – As it's currently written, it's hard to tell exactly what you're asking. — Madhavan, May 17 '23 at 11:37

Aaron · Answer 1 · 2023-05-18T14:18:41.463

While Step 4 is executed in parallel with the write application, few random records are missing in the ResultSet returned from the JDBC call.

Can someone explain this behaviour and let me know why this is happening?

Sure. The problem is with the queries:

Initially, "SELECT *" is issued to Cassandra table

No mention of a WHERE clause, so I'll assume that there isn't one.

The subsequent "SELECT *" are issued with a WHERE condition of timestamp > lastNotedTimestamp

So this one does have a WHERE clause. However, I don't see the partition key inside of it.

When you query Cassandra without specifying a partition key, there are no guarantees that the query can be served by a single node. This means (for both queries) that one node is being selected as a coordinator, to query the other nodes exhaustively. Once that node has assembled the result set (in RAM) it then returns it to the application.

Non-partition key queries cause Cassandra nodes to work very hard. Now, add into the mix that the cluster is also being asked it to handle the throughput of the write application, and I'm not surprised that you're seeing issues.

Essentially, this is happening because the current read/write patterns are too resource-intensive for your cluster to handle.

You can:

Design a specific table to suit your timestamp query pattern.
Use a multi-threaded approach to writing, throttling down the write throughput by limiting the number of concurrent write threads in the write application.
Never use SELECT * (without a WHERE clause) or the ALLOW FILTERING directive.

Edit 20230518

i tried with the partition key column in the where clause like "SELECT * FROM keyspace.table where partitionkeyColumn > latestPartitionColumnObservedPosition ALLOW FILTERING". I observed few record missing randomly for this case also.

Yes, because that query is still causing stress on all of the nodes. The partition key needs to be specified with an equality operator. Sure, it's getting a partition key. But all of the partition keys greater than latestPartitionColumnObservedPosition will still be spread across multiple nodes, so nothing is improving.

Given the table definition provided above, this table can support one performant query:

SELECT * FROM productInfo WHERE productId=?;

That's it. If there's any other query running on that table, it's going to be hitting multiple nodes and causing them to work hard.

If the concern is about the timestamp, then you might try building a new table with a "bucketing" approach, like this:

CREATE TABLE productinfo_by_day_bucket (
    productid TEXT,
    daybucket INT,
    productname TEXT,
    producttimestamp TIMESTAMP,
    price DECIMAL,
    PRIMARY KEY (daybucket, producttimestamp, productid)
WITH CLUSTERING ORDER BY (producttimestamp DESC, productid ASC);

This will store products updated for a particular day in a single partition, meaning that they can be queried like:

SELECT * FROM productinfo_by_day_bucket
WHERE daybucket=20230518
AND producttimestamp > '2023-05-18 14:10:05.705000+0000';

Depending on how many products are updated in a particular day, that "time bucket" may need to be a different unit of measure. But that's up to the business requirements. This also allows filtering on timestamp, as long as it's within a partition.

Note that producttimestamp is specified as a clustering key, sorting in descending order. This is because most time-based application queries tend to care about the most-recent data. Sorting in descending order allows queries to quickly pull records from the "top" of the partition, which will be faster.

productid is the last clustering key, but that's really just there as a tie-breaker to ensure uniqueness.

You should check out DataStax Academy, especially DS220 which is our free class on Cassandra data modeling. It really does a good job of explaining how everything works under-the-hood, and gives you data modeling and query building strategies to leverage that.

Thank you for the response. Anyways, i tried with the partition key column in the where clause like "SELECT * FROM keyspace.table where partitionkeyColumn > latestPartitionColumnObservedPosition ALLOW FILTERING". I observed few record missing randomly for this case also. — Mohamed Kamarudeen, May 18 '23 at 05:40

score 0 · Answer 2 · answered May 19 '23 at 07:59

What you see as "data loss" is most likely inconsistency between replicas.

What is probably happening is that the nodes cannot keep up with writes and dropping mutations leading to inconsistent replicas.

You need to run repairs regularly if you aren't already. But the long term solution is to add more nodes to your cluster so the nodes can keep up with the load.

In addition, I agree with Aaron that you need to redesign your data model because the table is not a good fit for your app query.

As a side note, it doesn't make sense to use the JDBC driver when you have a Java application. You should switch to using the Cassandra Java driver so you can natively query the database with CQL.

If you want to learn how, we have a lot of free tutorials and courses at datastax.com/dev. Cheers!

Observed data loss while fetching records using Cassandra JDBC driver

2 Answers2