1

Background Info

On my Postgres server I have a few tables that are constantly accessed by business intelligence applications so ideally they should stay available most of the time. The tables get flushed and reloaded by our ETL pipeline on a nightly basis (I know... due to some legacy setup I can't use incremental updates here). The loading process takes pretty long and is not bullet-proof stable yet.

To increase availability of the tables, I decided to use staging tables to load the downstream data from our ETL pipeline first, and if they get loaded successfully, copy the data over to actual production tables.

Here is a copy function I created for this purpose:

CREATE OR REPLACE FUNCTION guarded_copy(src text, dest text) RETURNS void AS
    $$
      DECLARE
        c1 INTEGER;
        c2 INTEGER;
      BEGIN
        EXECUTE FORMAT('SELECT COUNT(*) FROM %I', src) INTO c1;
        EXECUTE FORMAT('SELECT COUNT(*) FROM %I', dest) INTO c2;
        IF c1>=c2 THEN
          EXECUTE FORMAT('TRUNCATE TABLE %I CASCADE;', dest);
          EXECUTE FORMAT('INSERT INTO %I SELECT * FROM %I;', dest, src);
        END IF;
      END
    $$
LANGUAGE plpgsql VOLATILE;

The idea is to truncate the dest table and load it with data from the src table, if the src table(staging table) has more rows than the dest table(actual production table). And this actually works.

Note that the actual production table(dest) has constraints and indexes, while the staging table(src) is configured with NO indexes or constraints to speed up the loading process from ETL.

Problem

The issue with my function above is that data copy can be very expensive because of the indexes and constraints on the dest table.

Question

  1. What's a better way to achieve the same goal?
  2. I am thinking of dropping/disabling the index on dest before the data copy step and add it back right after. How to do this in a SQL function?
  3. I am also thinking about swapping the 2 tables by renaming them, but this requires the indexes on one table being copied to the other. How can I do this inside a function?

EDIT 1

Postgres version:
PostgreSQL 9.2.6 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit

Table constraints:
On table dest I have (unique) primary key on column id, and indexes on timestamp columns.

EDIT 2

This question and this question really helped. For option 3 above, I think the code below is close to what I want.

CREATE OR REPLACE FUNCTION guarded_swap(src text, dest text) RETURNS void AS
    $$
      DECLARE
        c1 INTEGER;
        c2 INTEGER;
        _query TEXT;
      BEGIN
        EXECUTE FORMAT('SELECT COUNT(*) FROM %I', src) INTO c1;
        EXECUTE FORMAT('SELECT COUNT(*) FROM %I', dest) INTO c2;
        IF c1>=c2 THEN

          -- create indexes in src table
          FOR _query IN
              SELECT FORMAT('%s;', REPLACE(pg_get_indexdef(ix.indexrelid), dest, src))
              FROM pg_class t, pg_class i, pg_index ix
              WHERE t.oid = ix.indrelid
                    AND i.oid = ix.indexrelid
                    AND t.relkind = 'r' and i.relkind = 'i'
                    AND t.oid= dest::regclass
              ORDER BY
                  t.relname, i.relname
          LOOP
              EXECUTE _query;
          END LOOP;

          -- drop indexes in dest table
          FOR _query IN
              SELECT FORMAT('DROP INDEX %s;', i.relname)
              FROM pg_class t, pg_class i, pg_index ix
              WHERE t.oid = ix.indrelid
                    AND i.oid = ix.indexrelid
                    AND t.relkind = 'r' and i.relkind = 'i'
                    AND t.oid= dest::regclass
              ORDER BY
                  t.relname, i.relname
          LOOP
              EXECUTE _query;
          END LOOP;


        -- create constraints in src table
          FOR _query IN
              SELECT
                  FORMAT ('ALTER TABLE %s ADD CONSTRAINT %s %s;', src,
                      REPLACE(conname, dest, src),
                      pg_get_constraintdef(oid))
              FROM pg_constraint
              WHERE contype = 'p' AND conrelid = dest::regclass
          LOOP
              EXECUTE _query;
          END LOOP;

          -- drop all constraints in dest table
          FOR _query IN
              SELECT
                  FORMAT ('ALTER TABLE %s DROP CONSTRAINT IF EXISTS %s;', dest, conname)
              FROM pg_constraint
              WHERE conrelid = dest::regclass
          LOOP
              EXECUTE _query;
          END LOOP;

          -- swap the table names
          EXECUTE FORMAT('ALTER TABLE %I RENAME TO %I;', dest, CONCAT(dest, '_old'));
          EXECUTE FORMAT('ALTER TABLE %I RENAME TO %I;', src, dest);
          EXECUTE FORMAT('ALTER TABLE %I RENAME TO %I;', CONCAT(dest, '_old'), src);


        END IF;
      END
    $$
LANGUAGE plpgsql VOLATILE; 

EDIT 3

Another thought: PKs and FKs might be unnecessary on tables that are only used for analytics purposes. So indexes are the only concern here.

CREATE OR REPLACE FUNCTION guarded_swap(src text, dest text) RETURNS void AS
    $$
      DECLARE
        c1 INTEGER;
        c2 INTEGER;
        _idx_name TEXT;
        _query TEXT;
        _qs TEXT[];
      BEGIN
        EXECUTE FORMAT('SELECT COUNT(*) FROM %I', src) INTO c1;
        EXECUTE FORMAT('SELECT COUNT(*) FROM %I', dest) INTO c2;
        IF c1>=c2 THEN

          -- drop indexes in dest table
          FOR _idx_name, _query IN
              SELECT i.relname, FORMAT('%s;', pg_get_indexdef(ix.indexrelid))
              FROM pg_class t, pg_class i, pg_index ix
              WHERE t.oid = ix.indrelid
                    AND i.oid = ix.indexrelid
                    AND t.relkind = 'r' and i.relkind = 'i'
                    AND t.oid= dest::regclass
              ORDER BY
                  t.relname, i.relname
          LOOP
              _qs := array_append(_qs, _query);
              EXECUTE FORMAT('DROP INDEX IF EXISTS %s;', _idx_name);
          END LOOP;

          -- swap the table names
          EXECUTE FORMAT('ALTER TABLE %I RENAME TO %I;', dest, CONCAT(dest, '_old'));
          EXECUTE FORMAT('ALTER TABLE %I RENAME TO %I;', src, dest);
          EXECUTE FORMAT('ALTER TABLE %I RENAME TO %I;', CONCAT(dest, '_old'), src);

          -- build indexes in dest table
          FOREACH _query IN ARRAY _qs
          LOOP
              EXECUTE _query;
          END LOOP;

        END IF;
      END
    $$
LANGUAGE plpgsql VOLATILE; 
Community
  • 1
  • 1
xiaolong
  • 3,396
  • 4
  • 31
  • 46
  • `table(dest) has constraints and indexes`: what constraints? And obviously you need to provide your version of Postgres. – Erwin Brandstetter Apr 15 '15 at 01:49
  • @ErwinBrandstetter, thanks for replying! Version 9.2.6. On table `dest` I have (unique) primary key on column `id`, and index on timestamp column. --Just edited the question as well. – xiaolong Apr 15 '15 at 02:48

2 Answers2

0

If you don't have other objects depending on the tables (like views and foreign keys), it's a simple operation to drop the existing table dest and just rename the new table src.
That's item 3. in your list, just without this consideration:

but this requires the indexes on one table being copied to the other.

You don't "copy" an index. Just create new identical indexes on the new table src in the background. Then the table is completely ready and the switcheroo is a matter of milliseconds. However, if you go that route with concurrent load on the table, you must be prepared to get some of these error messages in concurrent transactions:

ERROR:  could not open relation with OID 123456

Related answer with detailed explanation and sample code on dba.SE:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks for the answer Erwin! I am assuming item **3** would be fastest but haven't test it out. In the case that I want to make this function generic, how can I easily "duplicate" identical indexes on the new table `src`? I mean, is there a straightforward way to "select" indexes creation statements from `dest` and run it against `src`? And can that be done within a sql function? Thanks! – xiaolong Apr 15 '15 at 03:58
0

You could build a replication system, and the BI jobs will didnt affect to the main db server, this is a very common scenario on datawarehouse and BI. You keep a master server database, and on the other server configured as slave you can launch your (usually hard) bi queries and etls with real fresh data. but from another isolated source but with the exact same data.

http://www.rassoc.com/gregr/weblog/2013/02/16/zero-to-postgresql-streaming-replication-in-10-mins/

jacktrade
  • 3,125
  • 2
  • 36
  • 50