3

In Redshift, there's an STL_QUERY table that stores queries that were run over the last 5 days. I'm trying to find a way to keep more than 5 days worth of records. Here are some things that I've considered:

  1. Is there a Redshift setting for this? It would appear not.
  2. Could I use a trigger? Triggers are not available in Redshift, so this is a no-go.
  3. Could I create an Amazon Data Pipeline job to periodically "scrape" the STL_QUERY table? I could, so this is an option. Unfortunately, I would have to give the pipeline some EC2 instance to use to run this work. It seems like a waste to have an instance sitting around to scrape this table once a day.
  4. Could I use an Amazon Simple Work Flow job to scrape the table? I could, but it suffers from the same issues as 3.

Are there any other options/ideas that I'm missing? I would prefer some other option that does not involve me dedicating an EC2 instance, even if it means paying for an additional service (provided that it's cheaper than the EC2 instance I would have used in it's stead).

bstempi
  • 2,023
  • 1
  • 15
  • 27

1 Answers1

4

Keep it simple, do it all in Redshift.

First, use "CREATE TABLE … AS" to save all current history into a permanent table.

CREATE TABLE admin.query_history AS SELECT * FROM stl_query;

Second, using psql to run it, schedule a job on a machine you control to run this every day.

INSERT INTO admin.query_history SELECT * FROM stl_query WHERE query > (SELECT MAX(query) FROM admin.query_history);

Done. :)

Notes:

  • You need an 8.x version of psql if you haven't set this up yet.
  • Even if your job doesn't run for a few days stl_query keeps enough history that you'll be covered.
  • As per your comment, it might be safer to use starttime instead of query as the criteria.
Joe Harris
  • 13,671
  • 4
  • 47
  • 54
  • Thanks for the answer! Two things, one: is the `query` field guaranteed to increment? Should I be using `starttime` instead? Two: I don't like the idea of having it run on my laptop. Where to run this thing was the focus of the question. I'll let it go for one more day, but if no one else answers and you remove the "laptop" bit, I'll accept your answer. – bstempi Nov 08 '13 at 16:10
  • OK, I removed the reference to laptop. I suspect that you've got at least 1 machine in your control that runs 24/7. A cron job on that will do the trick. You could look at something like AWS Data Pipeline to run the script on a schedule but it seems like overkill for this use case. – Joe Harris Nov 08 '13 at 20:27