2

We want to query our QLDB Ledger using the SUM() function (with a WHERE clause referring only to indexed fields). e.g.,

select count(*), sum(NonIndexedField) from myTable where IndexedField1 = "foo" and IndexedField2 = "bar";

Is this a good or bad query pattern? The guidance at this page talks about how the COUNT() function is not optimized, and that makes me suspicious about the SUM() method as well.

Animesh
  • 1,765
  • 2
  • 22
  • 36
  • Please [edit] your question to show your WHERE clause. QLDB is inefficient when handing inequality filters such as `WHERE col BETWEEN a AND b` or `WHERE col LIKE 'abc%'` . – O. Jones Jun 25 '21 at 13:48
  • sure; sample query added. Only aiming to run `=` queries on indexed fields. – Animesh Jun 30 '21 at 01:39

2 Answers2

0

Animesh,

Overall, your query follows a good pattern because it depends primarily on an equality comparison against an index (IndexedField1). However, if there are many thousands of documents with IndexedField1 = "foo", your query may run slowly or even timeout. If there are fewer documents with IndexedField2 = "bar" than IndexedField1 = "foo", then put the IndexedField2 comparison first in the WHERE clause. QLDB may try to use both indexes, but the left-most one is the most important.

Besides performance, the other problem you might run into if there are a lot of documents to scan through after the index hit is OCC exceptions. If another process inserts/updates/deletes a document with an IndexedField1 value of "foo" or an IndexedField2 value of "bar", then your aggregation query will get an OCC exception. This is generally not a problem because the driver will just retry, but it can become a perceived performance problem if the frequency of collision is high and that frequency increases with the number of documents in your transaction scope. We recommend keeping your transaction scope as small as you can.

Generally speaking, QLDB is optimized for writes and targeted fetches of small amounts of data using an equality comparison against an indexed value. It's great for validation reads that inform a transaction decision (does my account have enough balance, does this record already exist, give me customer record identified by unique ID 'X', etc.). As the number of documents in your transaction scope widens, the potential for OCCs and transaction timeouts increases. We generally recommend you stream data from QLDB into a secondary database to support OLAP, reporting, ad-hoc queries, etc.

So the real answer is that your query looks pretty good, but you should consider what your data set might look like in production now, next year, etc. and test for that to make sure you're good to go.

DanB
  • 141
  • 3
0

In my personal experience using QLDB for almost a year now, you can never rely on it to perform any kind of non-indexed selection directly on the database.

Even when it's indexed, if you happen to have a low cardinality (meaning your index has repetitions as @DanB said) it will perform poorly.

Best approach is to either select what you want into the application (if viable) and perform the sum/count over there or use their journal export to manipulate all records of the database.

This second option may be tricky, though, because you must rely on inserts and updates that happened in a timeframe to get the information you want. Meaning you may need to export a HUGE cluster of files, if you need to go over all records available. And then process what you want. As you can see, this is, by all means, a non-live operation.

In short words, never think of QLDB as a relational or a no-sql database and things will work out better. It's a ledger. Period.

Vic Wolk
  • 166
  • 1
  • 14