4

I have been using partitioning with a postgreSQL database for a while. My database has grown quite a lot and does so nicely with partitioning. Unfortunately I now seem to have hit another barrier in speed and am trying to figure out some ways to speed up the database even more.

My basic setup is as follows: I have one master table called database_data from which all the partitions inherit. I chose to have one partition per month and name them like: database_data_YYYY_MM which works nicely.

By analyzing my data usage, I noticed, that I mostly do insert operations on the table and only some updates. The updates, however also occur on only a certain kind of row: I have a column called channel_id (a FK to another table). The rows I update always have a channel_id out of a set of maybe 50 IDs, so this would be a great way of distinguishing the rows that are never updated from the ones that potentially are.

I figured it would speed up my setup further if I would use the partitioning to have one table of insert only data and one of potentially updated data per month, as my updates would have to check less rows each time.

I could of course use the "simple" partitioning I am using now and add another table for each month called database_data_YYYY_MM_update and add the special constraints to that and the database_data_YYYY_MM table in order for the query planner to distinguish between the tables.

I was, however thinking, that I do sometimes have operations which operate on all data of a given month, no matter if updateable or not. In such a case I could JOIN the two tables but there could be an easier way for such queries.

So now to my real question:

Is "two layer" partitioning possible in PostgreSQL? What I mean by that is, that instead of having two tables for each month inheriting from the master table, I would only have one table per month directly inheriting from the master table e.g. database_data_YYYY_MM and then have two more tables inheriting from that table, one for the insert only data e.g. database_data_YYYY_MM_insert and one for the updateable data e.g. database_data_YYYY_MM_update.

Would this speed up the query planning at all? I would guess that it would be faster if the query planner could eliminate both tables at once if the intermediate table was eliminated.

The obvious advantage here would be that I could operate on all data of one month by simply using the table database_data_YYYY_MM and for my updates use the child table directly.

Any drawbacks that I am not thinking of?

Thank you for your thoughts.


Edit 1:

I don't think a schema is strictly necessary to answer my question but if it helps understanding I'll provide a sample schema:

CREATE TABLE database_data (
    id bigint PRIMARY KEY,
    channel_id bigint,    -- This is a FK to another table
    timestamp TIMESTAMP WITH TIME ZONE,
    value DOUBLE PRECISION
)

I have a trigger on the database_data table that generates the partitions on demand:

CREATE OR REPLACE FUNCTION function_insert_database_data() RETURNS TRIGGER AS $BODY$
DECLARE
    thistablename TEXT;
    thisyear INTEGER;
    thismonth INTEGER;
    nextmonth INTEGER;
    nextyear INTEGER;
BEGIN
    -- determine year and month of timestamp
    thismonth = extract(month from NEW.timestamp AT TIME ZONE 'UTC');
    thisyear = extract(year from NEW.timestamp AT TIME ZONE 'UTC');

    -- determine next month for timespan in check constraint
    nextyear = thisyear;
    nextmonth = thismonth + 1;
    if (nextmonth >= 13) THEN
        nextmonth = nextmonth - 12;
        nextyear = nextyear +1;
    END IF;

    -- Assemble the tablename

    thistablename = 'database_datanew_' || thisyear || '_' || thismonth;

    -- We are looping until it's successfull to catch the case when another connection simultaneously creates the table
    -- if that would be the case, we can retry inserting the data
    LOOP
        -- try to insert into table
        BEGIN
            EXECUTE 'INSERT INTO ' || quote_ident(thistablename) || ' SELECT ($1).*' USING NEW;
            -- Return NEW inserts the data into the main table allowing insert statements to return the values like "INSERT INTO ... RETURNING *"
            -- This requires us to use another trigger to delete the data again afterwards
            RETURN NEW;
        -- If the table does not exist, create it
        EXCEPTION
            WHEN UNDEFINED_TABLE THEN
                BEGIN
                    -- Create table with check constraint on timestamp
                    EXECUTE 'CREATE TABLE ' || thistablename || ' (CHECK ( timestamp >= TIMESTAMP WITH TIME ZONE '''|| thisyear || '-'|| thismonth ||'-01 00:00:00+00''
                        AND timestamp < TIMESTAMP WITH TIME ZONE '''|| nextyear || '-'|| nextmonth ||'-01 00:00:00+00'' ), PRIMARY KEY (id)
                        ) INHERITS (database_data)';
                    -- Add any trigger and indices to the table you might need
                    -- Insert the new data into the new table
                    EXECUTE 'INSERT INTO ' || quote_ident(thistablename) || ' SELECT ($1).*' USING NEW;
                    RETURN NEW;
                EXCEPTION WHEN DUPLICATE_TABLE THEN
                    -- another thread seems to have created the table already. Simply loop again.
                END;
            -- Don't insert anything on other errors
            WHEN OTHERS THEN
                RETURN NULL;
        END;
    END LOOP;
END;
$BODY$
LANGUAGE plpgsql;

CREATE TRIGGER trigger_insert_database_data
BEFORE INSERT ON database_data
FOR EACH ROW EXECUTE PROCEDURE function_insert_database_data();

As for sample data: Let's assume we only have two channels: 1 and 2. 1 is insert only data and 2 is updateable.

My two layer approach would be something like:

Main table:

CREATE TABLE database_data (
    id bigint PRIMARY KEY,
    channel_id bigint,    -- This is a FK to another table
    timestamp TIMESTAMP WITH TIME ZONE,
    value DOUBLE PRECISION
)

Intermediate table:

CREATE TABLE database_data_2015_11 (
    (CHECK ( timestamp >= TIMESTAMP WITH TIME ZONE '2015-11-01 00:00:00+00' AND timestamp < TIMESTAMP WITH TIME ZONE '2015-12-01 00:00:00+00)),
    PRIMARY KEY (id)
) INHERITS(database_data);

Partitions:

CREATE TABLE database_data_2015_11_insert (
    (CHECK (channel_id = 1)),
    PRIMARY KEY (id)
) INHERITS(database_data_2015_11);

CREATE TABLE database_data_2015_11_update (
    (CHECK (channel_id = 2)),
    PRIMARY KEY (id)
) INHERITS(database_data_2015_11);

Of course I would then need another trigger on the intermediate table to create the child tables on demand.

Tim
  • 1,272
  • 11
  • 28

1 Answers1

0

It's a clever idea, but sadly it doesn't seem to work. If I have a parent table with 1000 direct children, and I run a SELECT that should pull from just one child, then explain analyze gives me a planning time of around 16ms. On the other hand, if I have just 10 direct children, and they all have 10 children, and those all have 10 children, I get a query planning time of about 29ms. I was surprised---I really thought it would work!

Here is some ruby code I used to generate my tables:

0.upto(999) do |i|
  if i % 100 == 0
    min_group_id = i
    max_group_id = min_group_id + 100
    puts "CREATE TABLE datapoints_#{i}c (check (group_id > #{min_group_id} and group_id <= #{max_group_id})) inherits (datapoints);"
  end
  if i % 10 == 0
    min_group_id = i
    max_group_id = min_group_id + 10
    puts "CREATE TABLE datapoints_#{i}x (check (group_id > #{min_group_id} and group_id <= #{max_group_id})) inherits (datapoints_#{i / 100 * 100}c);"
  end
  puts "CREATE TABLE datapoints_#{i + 1} (check (group_id = #{i + 1})) inherits (datapoints_#{i / 10 * 10}x);"
end
Paul A Jungwirth
  • 23,504
  • 14
  • 74
  • 93