While Step 4 is executed in parallel with the write application, few random records are missing in the ResultSet returned from the JDBC call.
Can someone explain this behaviour and let me know why this is happening?
Sure. The problem is with the queries:
Initially, "SELECT *" is issued to Cassandra table
No mention of a WHERE
clause, so I'll assume that there isn't one.
The subsequent "SELECT *" are issued with a WHERE condition of timestamp > lastNotedTimestamp
So this one does have a WHERE
clause. However, I don't see the partition key inside of it.
When you query Cassandra without specifying a partition key, there are no guarantees that the query can be served by a single node. This means (for both queries) that one node is being selected as a coordinator, to query the other nodes exhaustively. Once that node has assembled the result set (in RAM) it then returns it to the application.
Non-partition key queries cause Cassandra nodes to work very hard. Now, add into the mix that the cluster is also being asked it to handle the throughput of the write application, and I'm not surprised that you're seeing issues.
Essentially, this is happening because the current read/write patterns are too resource-intensive for your cluster to handle.
You can:
- Design a specific table to suit your timestamp query pattern.
- Use a multi-threaded approach to writing, throttling down the write throughput by limiting the number of concurrent write threads in the write application.
- Never use
SELECT *
(without a WHERE
clause) or the ALLOW FILTERING
directive.
Edit 20230518
i tried with the partition key column in the where clause like "SELECT * FROM keyspace.table where partitionkeyColumn > latestPartitionColumnObservedPosition ALLOW FILTERING". I observed few record missing randomly for this case also.
Yes, because that query is still causing stress on all of the nodes. The partition key needs to be specified with an equality operator. Sure, it's getting a partition key. But all of the partition keys greater than latestPartitionColumnObservedPosition
will still be spread across multiple nodes, so nothing is improving.
Given the table definition provided above, this table can support one performant query:
SELECT * FROM productInfo WHERE productId=?;
That's it. If there's any other query running on that table, it's going to be hitting multiple nodes and causing them to work hard.
If the concern is about the timestamp
, then you might try building a new table with a "bucketing" approach, like this:
CREATE TABLE productinfo_by_day_bucket (
productid TEXT,
daybucket INT,
productname TEXT,
producttimestamp TIMESTAMP,
price DECIMAL,
PRIMARY KEY (daybucket, producttimestamp, productid)
WITH CLUSTERING ORDER BY (producttimestamp DESC, productid ASC);
This will store products updated for a particular day in a single partition, meaning that they can be queried like:
SELECT * FROM productinfo_by_day_bucket
WHERE daybucket=20230518
AND producttimestamp > '2023-05-18 14:10:05.705000+0000';
Depending on how many products are updated in a particular day, that "time bucket" may need to be a different unit of measure. But that's up to the business requirements. This also allows filtering on timestamp, as long as it's within a partition.
Note that producttimestamp
is specified as a clustering key, sorting in descending order. This is because most time-based application queries tend to care about the most-recent data.
Sorting in descending order allows queries to quickly pull records from the "top" of the partition, which will be faster.
productid
is the last clustering key, but that's really just there as a tie-breaker to ensure uniqueness.
You should check out DataStax Academy, especially DS220 which is our free class on Cassandra data modeling. It really does a good job of explaining how everything works under-the-hood, and gives you data modeling and query building strategies to leverage that.