I have been using partitioning with a postgreSQL database for a while. My database has grown quite a lot and does so nicely with partitioning. Unfortunately I now seem to have hit another barrier in speed and am trying to figure out some ways to speed up the database even more.
My basic setup is as follows:
I have one master table called database_data
from which all the partitions inherit. I chose to have one partition per month and name them like: database_data_YYYY_MM
which works nicely.
By analyzing my data usage, I noticed, that I mostly do insert operations on the table and only some updates. The updates, however also occur on only a certain kind of row: I have a column called channel_id
(a FK to another table). The rows I update always have a channel_id
out of a set of maybe 50 IDs, so this would be a great way of distinguishing the rows that are never updated from the ones that potentially are.
I figured it would speed up my setup further if I would use the partitioning to have one table of insert only data and one of potentially updated data per month, as my updates would have to check less rows each time.
I could of course use the "simple" partitioning I am using now and add another table for each month called database_data_YYYY_MM_update
and add the special constraints to that and the database_data_YYYY_MM
table in order for the query planner to distinguish between the tables.
I was, however thinking, that I do sometimes have operations which operate on all data of a given month, no matter if updateable or not. In such a case I could JOIN the two tables but there could be an easier way for such queries.
So now to my real question:
Is "two layer" partitioning possible in PostgreSQL? What I mean by that is, that instead of having two tables for each month inheriting from the master table, I would only have one table per month directly inheriting from the master table e.g. database_data_YYYY_MM
and then have two more tables inheriting from that table, one for the insert only data e.g. database_data_YYYY_MM_insert
and one for the updateable data e.g. database_data_YYYY_MM_update
.
Would this speed up the query planning at all? I would guess that it would be faster if the query planner could eliminate both tables at once if the intermediate table was eliminated.
The obvious advantage here would be that I could operate on all data of one month by simply using the table database_data_YYYY_MM
and for my updates use the child table directly.
Any drawbacks that I am not thinking of?
Thank you for your thoughts.
Edit 1:
I don't think a schema is strictly necessary to answer my question but if it helps understanding I'll provide a sample schema:
CREATE TABLE database_data (
id bigint PRIMARY KEY,
channel_id bigint, -- This is a FK to another table
timestamp TIMESTAMP WITH TIME ZONE,
value DOUBLE PRECISION
)
I have a trigger on the database_data table that generates the partitions on demand:
CREATE OR REPLACE FUNCTION function_insert_database_data() RETURNS TRIGGER AS $BODY$
DECLARE
thistablename TEXT;
thisyear INTEGER;
thismonth INTEGER;
nextmonth INTEGER;
nextyear INTEGER;
BEGIN
-- determine year and month of timestamp
thismonth = extract(month from NEW.timestamp AT TIME ZONE 'UTC');
thisyear = extract(year from NEW.timestamp AT TIME ZONE 'UTC');
-- determine next month for timespan in check constraint
nextyear = thisyear;
nextmonth = thismonth + 1;
if (nextmonth >= 13) THEN
nextmonth = nextmonth - 12;
nextyear = nextyear +1;
END IF;
-- Assemble the tablename
thistablename = 'database_datanew_' || thisyear || '_' || thismonth;
-- We are looping until it's successfull to catch the case when another connection simultaneously creates the table
-- if that would be the case, we can retry inserting the data
LOOP
-- try to insert into table
BEGIN
EXECUTE 'INSERT INTO ' || quote_ident(thistablename) || ' SELECT ($1).*' USING NEW;
-- Return NEW inserts the data into the main table allowing insert statements to return the values like "INSERT INTO ... RETURNING *"
-- This requires us to use another trigger to delete the data again afterwards
RETURN NEW;
-- If the table does not exist, create it
EXCEPTION
WHEN UNDEFINED_TABLE THEN
BEGIN
-- Create table with check constraint on timestamp
EXECUTE 'CREATE TABLE ' || thistablename || ' (CHECK ( timestamp >= TIMESTAMP WITH TIME ZONE '''|| thisyear || '-'|| thismonth ||'-01 00:00:00+00''
AND timestamp < TIMESTAMP WITH TIME ZONE '''|| nextyear || '-'|| nextmonth ||'-01 00:00:00+00'' ), PRIMARY KEY (id)
) INHERITS (database_data)';
-- Add any trigger and indices to the table you might need
-- Insert the new data into the new table
EXECUTE 'INSERT INTO ' || quote_ident(thistablename) || ' SELECT ($1).*' USING NEW;
RETURN NEW;
EXCEPTION WHEN DUPLICATE_TABLE THEN
-- another thread seems to have created the table already. Simply loop again.
END;
-- Don't insert anything on other errors
WHEN OTHERS THEN
RETURN NULL;
END;
END LOOP;
END;
$BODY$
LANGUAGE plpgsql;
CREATE TRIGGER trigger_insert_database_data
BEFORE INSERT ON database_data
FOR EACH ROW EXECUTE PROCEDURE function_insert_database_data();
As for sample data: Let's assume we only have two channels: 1 and 2. 1 is insert only data and 2 is updateable.
My two layer approach would be something like:
Main table:
CREATE TABLE database_data (
id bigint PRIMARY KEY,
channel_id bigint, -- This is a FK to another table
timestamp TIMESTAMP WITH TIME ZONE,
value DOUBLE PRECISION
)
Intermediate table:
CREATE TABLE database_data_2015_11 (
(CHECK ( timestamp >= TIMESTAMP WITH TIME ZONE '2015-11-01 00:00:00+00' AND timestamp < TIMESTAMP WITH TIME ZONE '2015-12-01 00:00:00+00)),
PRIMARY KEY (id)
) INHERITS(database_data);
Partitions:
CREATE TABLE database_data_2015_11_insert (
(CHECK (channel_id = 1)),
PRIMARY KEY (id)
) INHERITS(database_data_2015_11);
CREATE TABLE database_data_2015_11_update (
(CHECK (channel_id = 2)),
PRIMARY KEY (id)
) INHERITS(database_data_2015_11);
Of course I would then need another trigger on the intermediate table to create the child tables on demand.