Introduction
I am building a caching system where each node of the cache may call any number of SQL queries from a predefined, finite, set of SQL queries with 0-n parameters.
Based on the result of those queries, the node does a rather slow computation and returns a value that is cached.
Queries might look like:
Query #1:
SELECT name
FROM users
WHERE id = ?;
Query #2:
SELECT email
FROM emails
WHERE deleted_at IS NULL AND user_id = ?;
Other queries may use joins, have none or multiple parameters, but the number of queries is finite.
I keep track of the queries and parameters set that each node calls, and build a dependencies list. Then when a query result changes, I know I need to invalidate all cache nodes that depend on it and recompute their values.
Heart of the problem
The hard part is now knowing which queries and parameters set are impacted when I do an INSERT, UPDATE, or DELETE.
Examples
INSERT INTO users ("id", "name")
VALUES ('foo', 'John');
This operation will impact query #1 with parameters ['foo']
, and all cache nodes that depend on that query with exactly those parameters should be invalidated.
UPDATE users
SET birth_date = '1990-01-01'
WHERE id = 'foo';
This operation will not impact query #1 because it does not rely on column birth_date
to build its result.
DELETE FROM users
WHERE id = 'bar';
This will impact query #1 with parameters ['bar']
even tho after the operation no rows matches query #1.
First solution
The solution I came up with works but sure needs improvements.
- For each operation on the database keep track of a set of rows and columns that were impacted:
INSERT
: consider the inserted row, with all of its columns
UPDATE
: consider the row before, and after it was updated, with only the updated columns. You end up with 2 rows
DELETE
: consider the deleted row before it was deleted, with all of its columns - For each row found in step 1, find all queries that might be impacted. This is where I am doing a lot of manual work today. I am currently listing all dependencies of each query by hand. Example for
Q1
:
const dependencies = [
{
table: 'users',
columns: ['id', 'name'],
getParams: (row) => [[row.id]],
}
]
A few interesting things to notice:
- A query may depend on multiple tables when using joins, so dependencies is an array
- I list the columns the query depends on, so updates on other columns can be skipped
- We know a row impacts a query by looking at the table and the columns
- We need to find the set of parameters based on the row.
The result is an array because a row might impact the same query with multiple parameters set. In this basic example, the length of the array is only 1 because the row impacts the query with 1 parameter set.
Now consider the following query:
UPDATE users
SET id = 'bar'
WHERE id = 'foo';
Based on step 1, we build two rows:
{ id: 'foo' }
: value of the row before the update{ id: 'bar' }
: value of the row after the update
Notice that both rows only have the id
column, because we only updated this column.
Now looking at the dependencies array we built above, we know that both rows impact query Q1
because the table matches, and the columns overlap (they both have the id
column).
To find the parameters set I need to call getParams
for each row and flatten the results:
[['foo'], ['bar']]
.
And that is it. We now invalidate all cache nodes that depend on Q1
with parameters set ['foo']
or ['bar']
.
Open question
I am looking for any other route I might have overlooked. And most importantly I am looking for a way to build the dependencies of each query automatically, doing it by hand is slow, difficult, and error-prone.