Are Redshift system tables immutable and well ordered?

Question

Redshift system tables only story a few days of logging data - periodically backing up rows from these tables is a common practice to collect and maintain proper history. To find new rows added in to system logs I need to check against my backup tables either on query (number) or execution time.

According to an answer on How do I keep more than 5 day's worth of query logs? we can simply select all rows with query > (select max(query) from log). The answer is unreferenced and assumes that query is inserted sequentially.

My question in two parts - hoping for references or code-as-proof - is

are query (identifiers) expected to be inserted sequentially, and
are system tables, e.g. stl_query, immutable or unchanging?

Assuming that we can't verify or prove both the above, then what's the right strategy to backup the system tables?

I am wary of this because I fully expect long running queries to complete after many other queries have started and completed.

I know query (identifier) is generated at query submit time, because I can monitor in progress queries. Therefore it is completed expected that a long running query=1 may complete after query=2. If the stl_query table is immutable then query=1 will be inserted after query=2, and the max(query) logic is flawed.

Alternatively, if query=1 is inserted into stl_query at run time, then the row must be updated upon completion (with end time, duration, etc). This would required me to do an upsert into the backup table.

michael_erasmus · Accepted Answer · 2018-01-17T19:57:27.327

I think the stl_query table is indeed immutable, it would seem that it's only written to after a query finishes.

Here is why I think that. First off, I ran this query on a cluster with running queries

select count(*) from stl_query where endtime is null

This returns 0. My hunch is that you'll probably see the same thing on your side.

To be double sure, I also ran this query:

select count(*) from stv_inflight i
inner join stl_query q on q.query = i.query

This also returns zero (while I did have queries inflight), which seems to confirm that queries are only logged in stl_query when they have finished executing and are not updated.

That said, I would rewrite the query to insert into your history table as such:

insert into admin.query_history (
    select * from stl_query
    where query not in (select query from admin.query_history)
)

That way, you'll always insert any records you don't have in the history table.

Thanks Michael. I'm wary of the `not in` because these tables will eventually get big, but I guess that's over-optimizing. But for now I'm using `where endtime > max(endtime)` for those tables that have an `endtime` column. — Kirk Broadhurst, Jan 17 '18 at 19:57
If you are worried about performance, I would suggest creating a distkey on query and a sortkey on endtime. That way I suspect if you filter on endtime > max(endtime) and query not in(...), the query would still be pretty performant. — michael_erasmus, Jan 17 '18 at 20:03

Are Redshift system tables immutable and well ordered?

1 Answers1