2

Redshift system tables only story a few days of logging data - periodically backing up rows from these tables is a common practice to collect and maintain proper history. To find new rows added in to system logs I need to check against my backup tables either on query (number) or execution time.

According to an answer on How do I keep more than 5 day's worth of query logs? we can simply select all rows with query > (select max(query) from log). The answer is unreferenced and assumes that query is inserted sequentially.

My question in two parts - hoping for references or code-as-proof - is

  1. are query (identifiers) expected to be inserted sequentially, and
  2. are system tables, e.g. stl_query, immutable or unchanging?

Assuming that we can't verify or prove both the above, then what's the right strategy to backup the system tables?


I am wary of this because I fully expect long running queries to complete after many other queries have started and completed.

I know query (identifier) is generated at query submit time, because I can monitor in progress queries. Therefore it is completed expected that a long running query=1 may complete after query=2. If the stl_query table is immutable then query=1 will be inserted after query=2, and the max(query) logic is flawed.

Alternatively, if query=1 is inserted into stl_query at run time, then the row must be updated upon completion (with end time, duration, etc). This would required me to do an upsert into the backup table.

Kirk Broadhurst
  • 27,836
  • 16
  • 104
  • 169

1 Answers1

2

I think the stl_query table is indeed immutable, it would seem that it's only written to after a query finishes.

Here is why I think that. First off, I ran this query on a cluster with running queries

select count(*) from stl_query where endtime is null

This returns 0. My hunch is that you'll probably see the same thing on your side.

To be double sure, I also ran this query:

select count(*) from stv_inflight i
inner join stl_query q on q.query = i.query

This also returns zero (while I did have queries inflight), which seems to confirm that queries are only logged in stl_query when they have finished executing and are not updated.

That said, I would rewrite the query to insert into your history table as such:

insert into admin.query_history (
    select * from stl_query
    where query not in (select query from admin.query_history)
)

That way, you'll always insert any records you don't have in the history table.

michael_erasmus
  • 906
  • 1
  • 9
  • 17
  • Thanks Michael. I'm wary of the `not in` because these tables will eventually get big, but I guess that's over-optimizing. But for now I'm using `where endtime > max(endtime)` for those tables that have an `endtime` column. – Kirk Broadhurst Jan 17 '18 at 19:57
  • If you are worried about performance, I would suggest creating a distkey on query and a sortkey on endtime. That way I suspect if you filter on endtime > max(endtime) and query not in(...), the query would still be pretty performant. – michael_erasmus Jan 17 '18 at 20:03