Find which SQL queries from a list of predefined queries are impacted by an INSERT, UPDATE, or DELETE

Question

Introduction

I am building a caching system where each node of the cache may call any number of SQL queries from a predefined, finite, set of SQL queries with 0-n parameters.

Based on the result of those queries, the node does a rather slow computation and returns a value that is cached.

Queries might look like:

Query #1:

SELECT name 
FROM users 
WHERE id = ?;

Query #2:

SELECT email 
FROM emails 
WHERE deleted_at IS NULL AND user_id = ?;

Other queries may use joins, have none or multiple parameters, but the number of queries is finite.

I keep track of the queries and parameters set that each node calls, and build a dependencies list. Then when a query result changes, I know I need to invalidate all cache nodes that depend on it and recompute their values.

Heart of the problem

The hard part is now knowing which queries and parameters set are impacted when I do an INSERT, UPDATE, or DELETE.

Examples

INSERT INTO users ("id", "name") 
VALUES ('foo', 'John');

This operation will impact query #1 with parameters ['foo'], and all cache nodes that depend on that query with exactly those parameters should be invalidated.

UPDATE users 
SET birth_date = '1990-01-01' 
WHERE id = 'foo';

This operation will not impact query #1 because it does not rely on column birth_date to build its result.

DELETE FROM users 
WHERE id = 'bar';

This will impact query #1 with parameters ['bar'] even tho after the operation no rows matches query #1.

First solution

The solution I came up with works but sure needs improvements.

For each operation on the database keep track of a set of rows and columns that were impacted:
INSERT: consider the inserted row, with all of its columns
UPDATE: consider the row before, and after it was updated, with only the updated columns. You end up with 2 rows
DELETE: consider the deleted row before it was deleted, with all of its columns
For each row found in step 1, find all queries that might be impacted. This is where I am doing a lot of manual work today. I am currently listing all dependencies of each query by hand. Example for Q1:

const dependencies = [
  {
    table: 'users',
    columns: ['id', 'name'],
    getParams: (row) => [[row.id]], 
  }
]

A few interesting things to notice:

A query may depend on multiple tables when using joins, so dependencies is an array
I list the columns the query depends on, so updates on other columns can be skipped
We know a row impacts a query by looking at the table and the columns
We need to find the set of parameters based on the row.
The result is an array because a row might impact the same query with multiple parameters set. In this basic example, the length of the array is only 1 because the row impacts the query with 1 parameter set.

Now consider the following query:

UPDATE users 
SET id = 'bar' 
WHERE id = 'foo';

Based on step 1, we build two rows:

{ id: 'foo' }: value of the row before the update
{ id: 'bar' }: value of the row after the update

Notice that both rows only have the id column, because we only updated this column. Now looking at the dependencies array we built above, we know that both rows impact query Q1 because the table matches, and the columns overlap (they both have the id column).

To find the parameters set I need to call getParams for each row and flatten the results: [['foo'], ['bar']].

And that is it. We now invalidate all cache nodes that depend on Q1 with parameters set ['foo'] or ['bar'].

Open question

I am looking for any other route I might have overlooked. And most importantly I am looking for a way to build the dependencies of each query automatically, doing it by hand is slow, difficult, and error-prone.

As many RDBMS have really efficient internal query caching systems, may i ask you why do you look for a supplementary cache layer ? Is it for some research ? Are you building your own RDBMS/DBMS ? — Zilog80, May 20 '21 at 15:02
Checking your techblog gives me a partial answer. You're looking for a cache on GraphQL ? — Zilog80, May 20 '21 at 15:12
Each cache node actually does 0-n queries, and then computes a value based on the results of those queries. What I am caching is the result of this (rather slow) computation, not the result of any particular query. I am trying to trigger a cache invalidation when the value of the node should be computed again because one or more of the queries that the computation relies on might have been updated by an INSERT, UPDATE, or DELETE. I hope that makes more sense. — Nicolas Keller, May 20 '21 at 15:48
It makes sense. Your goal is to selectively invalidate cache entries based on data relations, then you have to look for overlapping ranges between DML statements (`INSERT`, `UPDATE`, etc.) and the queries associated with your cache entries. Thus you look for an automated way to build a dependency tree for each of the queries. Did i resume well ? — Zilog80, May 20 '21 at 17:01

Zilog80 · Answer 1 · 2021-05-21T15:58:15.953

On the other possible approach, i suggest you to check if you can work directly with your RDBMS for that, if your RDBMS have result caching ability. Some RDBMS can be interrogated about the result cache status of a SQL query, thus giving you directly a way to know if your cache entries are still valid or not without having to parse DML statements. Query object dependencies are also provided for at least one RDBMS, that could be useful for your automated dependencies build.

Pro:

It can be done query by query or for a bunch of queries, the RDBMS does the job for you.
The RDBMS can handle far more complex situations.
Scalable.
Reliable. The RDBMS are very rarely wrong.
You just have get the result cache id of your queries while executing them with the result cache status option.

Cons:

A big one. You need to submit at least a cache interrogation query to the RDBMS for your queries. That's implies network I/O and latency.
You need to configure/tune your RDBMS to use a result cache (generally, RDBMS have it enabled by default when they have this ability).
On an heavy loaded RDBMS, some query results wont be cached from time to time, this implies that the related cache entries will be invalidated thus adding load on the RDBMS...
Beware of restrictions about result caching, query with timestamp or sequence references will be generally excluded from result cache.
Result cache invalidating policy may not meets your needs (cache expiration, not fine grained invalidations, etc.)

For the first cons, a good way to handle it would be check the last DML execution on your RDBMS (through audit tables for example) before checking cache result for a bunch of queries. There's still a loss (I/O latency), but a least it will minimize the load for the RDBMS and your cache layer. (Not reliable, this can induce race conditions. If you have a query at T, get "ok no DML" for T(delta1) from audit at T(delta2), you'll submit your cache entry for T(delta2) while a DML could occur between T(delta1) and T(delta2).)

To illustrate the point, you can submit an EXPLAIN PLAN statement on Oracle RDBMS 11+ with the /*+ RESULT_CACHE */ hints to get a result cache id for your query. You can then later just query the v$result_cache_objects for this cache id with TYPE as Result and check the STATUS column. If it is different from Published (Invalidate, Sync, etc.) or if the cache id has changed, you can invalidate your cache entry. It means also that you need to get and store the query cache id of your query in your cache entry when you populate it or refresh it. And you should get the result cache id of your query immediately after executing your query with /*+ RESULT_CACHE */ and get it from the query DBMS bloc, so you have to use a SQL client/interface able to return theses data.
Documentation here..

With SQLServer, AFAIK, to this day, there's no result cache ability, SQLServer apply caching to its output buffer thus it can't permit that kind of usages.

Thanks for the detailed explaination @Zilog80 I am using Postgres and I believe that only Oracle has surch result caching capability. Maybe I can try the route of parsing the EXPLAIN result — Nicolas Keller, May 21 '21 at 09:41
@NicolasKeller I guess a more efficient way would be to use the postgresql metaschema through `pg_constraint.consrc`. — Zilog80, Jun 02 '21 at 15:08

Find which SQL queries from a list of predefined queries are impacted by an INSERT, UPDATE, or DELETE

Introduction

Heart of the problem

First solution

Open question

1 Answers1