Questions about Postgres track_commit_timestamp (pg_xact_commit_timestamp)

Question

I'm working on a design for a concurrency-safe incremental aggregate rollup system,and track_commit_timestamp (pg_xact_commit_timestamp) sounds perfect. But I've found very little commentary on it generally, and couldn't figure out how it works in detail from the source code.

Hopefully, someone knows the answers to one or more of my questions:

Is it possible for the commit timestamp feature to produce times out of order? What I'm after is a way to identify records that have been changed since a specific time so that I can get any later changes for processing. If there are identical timestamps, I don't need them in perfect commit sequence.
How many bytes are added to each row in the final implementation? The discussions I saw seemed to be ranging from 12-24 bytes. There was discussion of adding in extra bytes for "just in case." This is pre 9.5, so a world ago.
Are the timestamps indexed internally? With a B-tree? I ask for capacity-planning reasons.
I've seen on StackOverflow and the design discussions that the timestamps are not kept indefinitely, but can't find the details on exactly how long they are stored.
Any rules of thumb on the performance impact of enabling track_commit_timestamp? I don't need the data on all tables but, where I do, it sounds like it might work perfectly.
Any gotchas? I tried running VACUUM FULL on a test table and none of the pg_xact_commit_timestamp changed. It seems like a physical operation like VACUUM shouldn't change anything, but there could easily be something I've not thought of. And, honestly, my quick VACUUM test might not even mean anything.

Many thanks for any assistance!

I've edited my question to clarify what I'm trying to accomplish, I'm looking to track processed and unprocessed data based on update stamps.

select max(pg_xact_commit_timestamp(xmin)) from scan;--   2019-07-07 20:46:14.694288+10

update scan set quantity = 5 where quantity = 1; --       Change some data.

select max(pg_xact_commit_timestamp(xmin)) from scan; --  2019-07-10 09:38:17.920294+10

-- Find the changed row(s):
select * 
  from scan 
 where pg_xact_commit_timestamp(xmin) > '2019-07-07 20:46:14.694288+10';

The idea is to do a rollup on rows incrementally and regularly. So,

-- Track the last rolled up timestamp. -- Wait for 5 minutes (or whatever.) -- Find the current max commit timestamp. -- Search for rows where the commit timestamp is between the last processed timestamp and the max. -- Roll them up.

Transaction IDs alone can't work because they can commit out of order very easily. And this timestamp system doesn't have to be 100% perfect, but I'm aiming for something very close to perfect. So, a bit of clock wiggle and even a bit of confusion around overlapping start/end times is likely tolerable.

Is there a glaring flaw in this plan?

score 18 · Accepted Answer · edited Jun 20 '20 at 09:12

As this subject doesn't seem to show up in the archives very much, I want to add a bit of detail before moving on. I asked related questions on several lists, forums, and by direct communication. Several people were kind enough to review the source code, provide historical background, and clear this up for me. Hopefully, leaving some detail here will help someone else down the track. Errors are all mine, obviously, corrections and enhancements more than welcome.

Commit timestamps are assigned when the transaction's work is completed, but that's not the same was when it is committed. The WAL writer doesn't update the stamps to keep them in chronological sequence.
Therefore, commit timestamps are definitely not a reliable mechanism for finding changes rows in order.
Multiple clocks. Self-adjusting clocks. Oh the humanity!
If you do want an in order-change sequence, logical decoding or replication are options. (I tried out logical replication a couple of weeks ago experimentally. Coolest. Thing. Ever.)
The cost of timestamp tracking is 12 bytes per transaction, not per row. So, not so bad. (Timestamps are 8 bytes, transaction IDs are 4 bytes.)
This is all part of the existing transaction system, so the realities of transaction ID rollaround apply here too. (Not scary in my case.) See:

https://www.postgresql.org/docs/current/routine-vacuuming.html
For the record, you can enable this option on RDS via a parameter group setting. Just set track_commit_timestamp to 1 and restart. (The setting is 'on' in an postgres.conf.)

“Commit timestamps are assigned when the transaction's work is completed, but that's not the same was when it is committed. The WAL writer doesn't update the stamps to keep them in chronological sequence.” Does this mean you can’t rely on this to even determine the order transactions occurred in (even if you don’t care about the exact time)? — Jordan, Jul 29 '20 at 21:59

Laurenz Albe · Answer 2 · 2019-07-10T03:50:22.737

Lots of questions.

For a reference, the source code is in src/backend/access/transam/commit_ts.c.

I am not sure if it can be guaranteed that a later commit log sequence number implies a later timestamp. I would certainly not rely totally on it if the system clock can jump backwards due to time adjustments.

The timestamp is not stored in the row at all, but in the pg_commit_ts subdirectory of the data directory. Each record takes 10 bytes:

/*
 * We need 8+2 bytes per xact.  Note that enlarging this struct might mean
 * the largest possible file name is more than 5 chars long; see
 * SlruScanDirectory.
 */
typedef struct CommitTimestampEntry
{
    TimestampTz time;
    RepOriginId nodeid;
} CommitTimestampEntry;

There is also information about commit timestamps in the transaction log so it can be recovered.

No index is needed, because the location of the timestamp is determined by the transaction number (each transaction has a fixed location for the commit timestamp). See TransactionIdToCTsPage.
Timestamps are kept as long as transaction numbers, if I understand the code correctly.
I can't tell what the overhead is, but it probably isn't huge.
Why should VACUUM or VACUUM (FULL) change the commit timestamp? That would be a bug.

Now that I understand what you want to achieve with commit timestamps, a word to that (I wish people would state the real question right away):

Commit timestamps are not the right tool for you. You could not index the expression, because pg_xact_commit_timestamp is not immutable.

Choose the simple and obvious solution and add an extra timestamp with time zone column with a BEFORE trigger that sets it to current_timestamp on INSERT and UPDATE. That can be indexed.

A famous man has said that premature optimization is the root of all evil.

Thank you for the quick, detailed answer. You rule! Thanks also to the pointer to the right source file. It's helpful, as far as I can understand it. I've updated my question to explain what I'm trying to accomplish. > Timestamps are kept as long as transaction numbers, if I understand the code correctly. Okay, if it's like a roll-around limit, that makes them more than good enough for our purposes. — Morris de Oryx, Jul 09 '19 at 23:55
> I can't tell what the overhead is, but it probably isn't huge. I've noticed that a lot of stats collection and replication services require the commit timestamp option to be on, and haven't found anyone complaining about it. > Why should VACUUM or VACUUM (FULL) change the commit timestamp? That would be a bug. As far as I understand it, it *shouldn't* change anything as a physical record movement isn't a transaction in the database sense. But, I'm new enough to Postgres to not want to assume things, and I *definitely* do not know what I do not know. — Morris de Oryx, Jul 09 '19 at 23:56

Morris de Oryx · Answer 3 · 2019-07-10T11:50:32.557

Laurenz, first off, you're a champion for digging in and helping me out. Thank you. For background, I've asked this question in more detail on a few of the PG mailing lists, and got zero responses. I think it was because my complete question was too long.

I tried to be shorter here and, sadly, have not explained the important part clearly. Physical optimization is not the driving concern. In fact, the commit_timestamp system will cost me space as it's a global setting for all tables. My real tables will have full timestamptz (set to UTC) fields that I'll index and aggregate against. What I'm trying to sort out now (design phase) is the accuracy of the approach. Namely, am I capturing all events once and only once?

What I'm need is a reliable sequential number or time line to mark the highest/latest row I processed and the current highest/latest row. This lets me grab any rows that have not been processed without re-selecting already handled rows, or blocking the table as it adds new rows. This idea is called a "concurrency ID" in some contexts. Here's a sketch adapted from another part of our project where it made sense to use numbers instead of timestamps (but timelines are a type of number line):

D'oh! I can't post images. It's here:

https://i.stack.imgur.com/JPKvO.jpg

It shows a number line for tracking records that are in three portions [Done][Capture these][Tailing]

"Done" is everything from the highest/latest counter processed.

"Capture these" is everything later than "Done" and less than the current max counter in the table.

"Tailing" is any new, higher counters added by other inputs while the "capture these" rows are being processed.

It's easier to see in a picture.

So, I've got a small utility table such as this:

CREATE TABLE "rollup_status" (
    "id" uuid NOT NULL DEFAULT extensions.gen_random_uuid(), -- We use UUIDs, not necessary here, but it's what we use. 
    "rollup_name" text NOT NULL DEFAULT false,               
    "last_processed_dts" timestamptz NOT NULL DEFAULT NULL); -- Marks the last timestamp processed.

And now imagine one entry:

rollup_name         last_processed_dts
error_name_counts   2018-09-26 02:23:00

So, my number line (timeline, in the case of the commit timestamps) is processed from whatever the 0 date is through 2018-09-26 02:23:00. The next time through, I get the current max from the table I'm interested in, 'scan':

select max(pg_xact_commit_timestamp(xmin)) from scan; -- Pretend that it's 2019-07-07 25:00:00.0000000+10

This value becomes the upper bound of my search, and the new value of rollup_status.last_processed_dts.

-- Find the changed row(s):
select * 
  from scan 
 where pg_xact_commit_timestamp(xmin) >  '2019-07-07 20:46:14.694288+10' and
       pg_xact_commit_timestamp(xmin) <= '2019-07-07 25:00:00.0000000+10

That's the "capture these" segment of my number line. This is also the only use I've got planned for the commit timestamp data. We're pushing data in from various sources, and want their timestamps (adjusted to UTC), not a server timestamp. (Server timestamps can make sense, they just don't happen to in the case of our data.) So, the sole purpose of the commit timestamp is to create a reliable number line.

If you look at the chart, it shows three different number lines for the same base table. The table itself only has one number or timeline, there are three different uses of that number/time series. So, three rollup_status rows, going with my sketch table from earlier. The "scan" table needs to know nothing about how it is used. This is a huge benefit of this strategy. You can add, remove, and redo operations without having to alter the master table or its rows at all.

I'm also considering an ON AFTER INSERT/UPDATE selection trigger with a transition table for populating a timestamptz (set to UTC), like row_commmitted_dts. That might be my plan B, but it requires adding the triggers and it seems like it could only be a bit less accurate than the actual transaction commit time. Probably a small difference, but with concurrency stuff, little problems can blow up into big bugs in a hurry.

So, the question is if I can count on the commit timestamp system to produce accurate results that won't appear "in the past." That's why I can't use transaction IDs. They're assigned at the start of the transaction, but can be committed in any order. (As I understand it.) Therefore, my range boundaries of "last processed" and "current maximum in file" can't work. I could get that range and a pending transaction could commit with thousands of records with a timestamp earlier than my previously recorded "max value." That's why I'm after commit stamps.

Again, thanks for any help or suggestions. I'm very grateful.

P.S The only discussion I've run into in the Postgres world with something like this is here:

Scalable incremental data aggregation on Postgres and Citus https://www.citusdata.com/blog/2018/06/14/scalable-incremental-data-aggregation/

They're using bigserial counters in this way but, as far as I understand it, that only works for INSERT, not UPDATE. And, honestly, I don't know enough about Postgres transactions and serials to think through the concurrency behavior.

Questions about Postgres track_commit_timestamp (pg_xact_commit_timestamp)

3 Answers3

Linked