I've got a remedial question regarding sequences. I've used them a bit and consulted the docs, and am hoping that this is an easy question for the group. We're on Postgres 11.4 now, and will move to PG 12 whenever it's available on RDS.
The goal is to have a set of numbers that increase every time a row is inserted or updated. We use the field name "con_id" (concurrency ID) for this kind of counter. So the first time a row is inserted in an empty table, the value is 1, the second row gets 2, etc. Sounds like a SEQUENCE
. I had a standard sequence in this role, then switched to AS IDENTITY
...but realize now that was probably a mistake.
On update, the counter should keep working. So, if the first row is updated, the con_id changes from 1 to 3, the current max()+1
. For the record, all of our updates use ON CONFLICT(id) SET
, not just a straight UPDATE
.
The point of the number series is to define start-stop bounds for various operations:
operation last_number_processed
sync_to_domo 124556
rollup_day 123516
rollup_week 103456
Then, when it's time to perform one of these operations, all you need to do to find the right chunk of records is select con_id from last_number_processed+1 through max(con_id). You have to update the operation tracker with that max(con_id) once the operation completes.
select max(con_id) from source_data; -- Get the current highest number.
The range is then something like 124557-128923 for "sync_to_domo".
Uniqueness isn't required here, although it is desirable. Gaps don't matter at all. Keeping the numbers in sequence is essential.
This kind of operation on update is the sort of thing that could easily be a horrific bottleneck, if I botched it. Can anyone suggest the best reliable, low-contention strategy for maintaining a counter that gets the max value from the table +1 on each insert or update?
And, yes, a timestamptz can be used for this purpose, it's just another kind of number line. The reason for an integer is to match how we code other systems. It's just easier to explain and reason about this stuff when the data type remains the same across platforms. Or so it seems, in this case.
Test Code
I'm adding some test code and results to my original question. This works, but I doubt it's super efficient. The test below is a minimal version, and not useful...I'm just trying to determine if I can get an ever-increasing number line of insertions and revisions. This effort looks okay on a quick check, but there's so much that I do not have internalized about Postgres' approach to contention, that I'm including it mostly so that people can tell me why it's terrible. So, please do ;-)
The setup is to have a sequence that's assigned automatically on INSERT
and via per ROW
trigger on UPDATE
. Sounds less than idea, is there a better way? It works in a single-connection test, but double-increments the counters on UPDATE
. That's not a problem for me, but I do not understand why it is happening.
Here's the stand-alone test code:
DROP TABLE IF EXISTS data.test_con_ids;
DROP SEQUENCE IF EXISTS test_con_ids_sequence;
DROP FUNCTION IF EXISTS data.test_con_ids_update;
BEGIN;
CREATE SEQUENCE data.test_con_ids_sequence
AS bigint;
CREATE TABLE data.test_con_ids (
id integer NOT NULL DEFAULT NULL PRIMARY KEY,
con_id bigint NOT NULL DEFAULT NEXTVAL ('test_con_ids_sequence'),
dts timestamptz default NOW()
);
CREATE OR REPLACE FUNCTION data.test_con_ids_update()
RETURNS trigger
LANGUAGE plpgsql
AS $function$
BEGIN
NEW.con_id := NEXTVAL ('test_con_ids_sequence');
RETURN NEW;
END
$function$;
-- It's late here, not sure if I could use a FOR EACH STATEMENT trigger here, which should be faster. If it would work.
CREATE TRIGGER test_con_ids_update_trigger BEFORE UPDATE ON data.test_con_ids
FOR EACH ROW EXECUTE PROCEDURE test_con_ids_update();
-- Add ten records, IDs 1-10, con_ids 1-10.
INSERT INTO data.test_con_ids (id)
VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);
-- Update rows 1-5 to their current values. The trigger should increment the con_id.
INSERT INTO data.test_con_ids (id)
VALUES (1),(2),(3),(4),(5)
ON CONFLICT(id) DO UPDATE set id = EXCLUDED.id; -- Completely pointless, obviously...just a test. We always UPSERT with ON CONFLICT DO UPDATE.
COMMIT;
And here are the results:
id con_id dts
1 12 2019-11-02 21:52:34.333926+11
2 14 2019-11-02 21:52:34.333926+11
3 16 2019-11-02 21:52:34.333926+11
4 18 2019-11-02 21:52:34.333926+11
5 20 2019-11-02 21:52:34.333926+11
6 6 2019-11-02 21:52:34.333926+11
7 7 2019-11-02 21:52:34.333926+11
8 8 2019-11-02 21:52:34.333926+11
9 9 2019-11-02 21:52:34.333926+11
10 10 2019-11-02 21:52:34.333926+11
That works, 1-10 are created, 1-5 are updated and get their con_id counter values increment. By 2 for some reason (?), but at least they're in a useful order, which is what we need.
Can anyone offer suggestions on how to get this behavior more efficiently? The goal is an ever-increasing number line for records reflecting last INSERT
and UPDATE
activities. And, because we use integers for this everywhere else, we're trying to stick with integers instead of timestamps. But, honestly, that's cosmetic in a lot of ways. Another reason I'm looking at SEQUENCE
is that, unless I've misunderstood, it's not bound up in the transaction. That's perfect for this...we don't need a gapless number series, just a sequential one.
Postgres 12 Test
Following on Belayer's suggestion, I created a PG 12 database as an experiment. I went with the defaults, so everything is in public
. (In the real world, I strip out public
.) Yes, a generated column seems to work, so long as you have an immutable function. I've read about IMMUTABLE
in Postgres several times....and I don't get it. So, I can't say that this function is safe. Seems like it should be. I followed the patterns used in this worthwhile piece:
https://www.2ndquadrant.com/en/blog/generated-columns-in-postgresql-12/
CREATE OR REPLACE FUNCTION public.generate_concurrency_id() RETURNS bigint
AS $$
SELECT EXTRACT(EPOCH FROM clock_timestamp())::bigint;
$$
LANGUAGE sql IMMUTABLE;
COMMENT ON FUNCTION public.generate_concurrency_id() IS 'Generate a bigint to act as a progressive change counter for a table.';
DROP TABLE IF EXISTS public.test_con_ids;
CREATE TABLE test_con_ids (
id integer NOT NULL DEFAULT NULL PRIMARY KEY,
con_id bigint GENERATED ALWAYS AS (generate_concurrency_id()) STORED,
dts timestamptz default NOW()
);
-- Add ten records, IDs 1-10, con_ids 1-10.
INSERT INTO public.test_con_ids (id)
VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);
-- Unless you wait here, the con_ids all look the same...this operation is way too short to tick over to another second, unless you get lucky.
-- Update rows 1-5 to their current values. The trigger should increment the con_id.
INSERT INTO public.test_con_ids (id)
VALUES (1),(2),(3),(4),(5)
ON CONFLICT(id) DO UPDATE set id = EXCLUDED.id; -- Completely pointless, obviously...just a test. We always UPSERT with ON CONFLICT DO UPDATE.
The example above does work with the generated column, but I don't know about the performance characteristics of PG 12 calculated columns vs. triggers.
On a sadder note, I'm thinking that this might not work out for me at all. I really need a commit time stamp, which is only available through logical decoding. Here's why, through an example.
- I'm summarizing data and want to grab unprocessed updated or inserted rows.
- The last number I got is from a record added at 01:00:00 AM. Now I want to get all higher numbers.
- Okay...I do that.
- Oh wait, an incomplete transaction comes in and commits timestamps/derived numbers from earlier.
A SEQUENCE
doesn't really work here either because, again, the row/tuple isn't visible to my collector process until after the transaction commits.
So I think it's logical decoding, an update summary table, or bust.