Background Info
On my Postgres server I have a few tables that are constantly accessed by business intelligence applications so ideally they should stay available most of the time. The tables get flushed and reloaded by our ETL pipeline on a nightly basis (I know... due to some legacy setup I can't use incremental updates here). The loading process takes pretty long and is not bullet-proof stable yet.
To increase availability of the tables, I decided to use staging tables to load the downstream data from our ETL pipeline first, and if they get loaded successfully, copy the data over to actual production tables.
Here is a copy function I created for this purpose:
CREATE OR REPLACE FUNCTION guarded_copy(src text, dest text) RETURNS void AS
$$
DECLARE
c1 INTEGER;
c2 INTEGER;
BEGIN
EXECUTE FORMAT('SELECT COUNT(*) FROM %I', src) INTO c1;
EXECUTE FORMAT('SELECT COUNT(*) FROM %I', dest) INTO c2;
IF c1>=c2 THEN
EXECUTE FORMAT('TRUNCATE TABLE %I CASCADE;', dest);
EXECUTE FORMAT('INSERT INTO %I SELECT * FROM %I;', dest, src);
END IF;
END
$$
LANGUAGE plpgsql VOLATILE;
The idea is to truncate the dest
table and load it with data from the src
table, if the src
table(staging table) has more rows than the dest
table(actual production table). And this actually works.
Note that the actual production table(dest
) has constraints and indexes, while the staging table(src
) is configured with NO indexes or constraints to speed up the loading process from ETL.
Problem
The issue with my function above is that data copy can be very expensive because of the indexes and constraints on the dest
table.
Question
- What's a better way to achieve the same goal?
- I am thinking of dropping/disabling the index on
dest
before the data copy step and add it back right after. How to do this in a SQL function? - I am also thinking about swapping the 2 tables by renaming them, but this requires the indexes on one table being copied to the other. How can I do this inside a function?
EDIT 1
Postgres version:
PostgreSQL 9.2.6 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit
Table constraints:
On table dest
I have (unique) primary key on column id
, and indexes on timestamp columns.
EDIT 2
This question and this question really helped. For option 3 above, I think the code below is close to what I want.
CREATE OR REPLACE FUNCTION guarded_swap(src text, dest text) RETURNS void AS
$$
DECLARE
c1 INTEGER;
c2 INTEGER;
_query TEXT;
BEGIN
EXECUTE FORMAT('SELECT COUNT(*) FROM %I', src) INTO c1;
EXECUTE FORMAT('SELECT COUNT(*) FROM %I', dest) INTO c2;
IF c1>=c2 THEN
-- create indexes in src table
FOR _query IN
SELECT FORMAT('%s;', REPLACE(pg_get_indexdef(ix.indexrelid), dest, src))
FROM pg_class t, pg_class i, pg_index ix
WHERE t.oid = ix.indrelid
AND i.oid = ix.indexrelid
AND t.relkind = 'r' and i.relkind = 'i'
AND t.oid= dest::regclass
ORDER BY
t.relname, i.relname
LOOP
EXECUTE _query;
END LOOP;
-- drop indexes in dest table
FOR _query IN
SELECT FORMAT('DROP INDEX %s;', i.relname)
FROM pg_class t, pg_class i, pg_index ix
WHERE t.oid = ix.indrelid
AND i.oid = ix.indexrelid
AND t.relkind = 'r' and i.relkind = 'i'
AND t.oid= dest::regclass
ORDER BY
t.relname, i.relname
LOOP
EXECUTE _query;
END LOOP;
-- create constraints in src table
FOR _query IN
SELECT
FORMAT ('ALTER TABLE %s ADD CONSTRAINT %s %s;', src,
REPLACE(conname, dest, src),
pg_get_constraintdef(oid))
FROM pg_constraint
WHERE contype = 'p' AND conrelid = dest::regclass
LOOP
EXECUTE _query;
END LOOP;
-- drop all constraints in dest table
FOR _query IN
SELECT
FORMAT ('ALTER TABLE %s DROP CONSTRAINT IF EXISTS %s;', dest, conname)
FROM pg_constraint
WHERE conrelid = dest::regclass
LOOP
EXECUTE _query;
END LOOP;
-- swap the table names
EXECUTE FORMAT('ALTER TABLE %I RENAME TO %I;', dest, CONCAT(dest, '_old'));
EXECUTE FORMAT('ALTER TABLE %I RENAME TO %I;', src, dest);
EXECUTE FORMAT('ALTER TABLE %I RENAME TO %I;', CONCAT(dest, '_old'), src);
END IF;
END
$$
LANGUAGE plpgsql VOLATILE;
EDIT 3
Another thought: PKs and FKs might be unnecessary on tables that are only used for analytics purposes. So indexes are the only concern here.
CREATE OR REPLACE FUNCTION guarded_swap(src text, dest text) RETURNS void AS
$$
DECLARE
c1 INTEGER;
c2 INTEGER;
_idx_name TEXT;
_query TEXT;
_qs TEXT[];
BEGIN
EXECUTE FORMAT('SELECT COUNT(*) FROM %I', src) INTO c1;
EXECUTE FORMAT('SELECT COUNT(*) FROM %I', dest) INTO c2;
IF c1>=c2 THEN
-- drop indexes in dest table
FOR _idx_name, _query IN
SELECT i.relname, FORMAT('%s;', pg_get_indexdef(ix.indexrelid))
FROM pg_class t, pg_class i, pg_index ix
WHERE t.oid = ix.indrelid
AND i.oid = ix.indexrelid
AND t.relkind = 'r' and i.relkind = 'i'
AND t.oid= dest::regclass
ORDER BY
t.relname, i.relname
LOOP
_qs := array_append(_qs, _query);
EXECUTE FORMAT('DROP INDEX IF EXISTS %s;', _idx_name);
END LOOP;
-- swap the table names
EXECUTE FORMAT('ALTER TABLE %I RENAME TO %I;', dest, CONCAT(dest, '_old'));
EXECUTE FORMAT('ALTER TABLE %I RENAME TO %I;', src, dest);
EXECUTE FORMAT('ALTER TABLE %I RENAME TO %I;', CONCAT(dest, '_old'), src);
-- build indexes in dest table
FOREACH _query IN ARRAY _qs
LOOP
EXECUTE _query;
END LOOP;
END IF;
END
$$
LANGUAGE plpgsql VOLATILE;