Optimize date query for large child tables: GiST or GIN?

Question

Problem

72 child tables, each having a year index and a station index, are defined as follows:

CREATE TABLE climate.measurement_12_013
(
-- Inherited from table climate.measurement_12_013:  id bigint NOT NULL DEFAULT nextval('climate.measurement_id_seq'::regclass),
-- Inherited from table climate.measurement_12_013:  station_id integer NOT NULL,
-- Inherited from table climate.measurement_12_013:  taken date NOT NULL,
-- Inherited from table climate.measurement_12_013:  amount numeric(8,2) NOT NULL,
-- Inherited from table climate.measurement_12_013:  category_id smallint NOT NULL,
-- Inherited from table climate.measurement_12_013:  flag character varying(1) NOT NULL DEFAULT ' '::character varying,
  CONSTRAINT measurement_12_013_category_id_check CHECK (category_id = 7),
  CONSTRAINT measurement_12_013_taken_check CHECK (date_part('month'::text, taken)::integer = 12)
)
INHERITS (climate.measurement)

CREATE INDEX measurement_12_013_s_idx
  ON climate.measurement_12_013
  USING btree
  (station_id);
CREATE INDEX measurement_12_013_y_idx
  ON climate.measurement_12_013
  USING btree
  (date_part('year'::text, taken));

(Foreign key constraints to be added later.)

The following query runs abysmally slow due to a full table scan:

SELECT
  count(1) AS measurements,
  avg(m.amount) AS amount
FROM
  climate.measurement m
WHERE
  m.station_id IN (
    SELECT
      s.id
    FROM
      climate.station s,
      climate.city c
    WHERE
        /* For one city... */
        c.id = 5182 AND

        /* Where stations are within an elevation range... */
        s.elevation BETWEEN 0 AND 3000 AND

        /* and within a specific radius... */
        6371.009 * SQRT( 
          POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
            (COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
              POW(RADIANS(c.longitude_decimal - s.longitude_decimal), 2))
        ) <= 50
    ) AND

  /* Data before 1900 is shaky; insufficient after 2009. */
  extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND

  /* Whittled down by category... */
  m.category_id = 1 AND

  /* Between the selected days and years... */
  m.taken BETWEEN
   /* Start date. */
   (extract( YEAR FROM m.taken )||'-01-01')::date AND
    /* End date. Calculated by checking to see if the end date wraps
       into the next year. If it does, then add 1 to the current year.
    */
    (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
      sign(
        (extract( YEAR FROM m.taken )||'-12-31')::date -
        (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
    ) AS text)||'-12-31')::date
GROUP BY
  extract( YEAR FROM m.taken )

The sluggishness comes from this part of the query:

  m.taken BETWEEN
    /* Start date. */
  (extract( YEAR FROM m.taken )||'-01-01')::date AND
    /* End date. Calculated by checking to see if the end date wraps
      into the next year. If it does, then add 1 to the current year.
    */
    (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
      sign(
        (extract( YEAR FROM m.taken )||'-12-31')::date -
        (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
    ) AS text)||'-12-31')::date

This part of the query matches a selection of days. For example, if the user wants to look at data between June 1st and July 1st over all the years for which there is data, the above clause matches against just those days. If the use wants to look at data between December 22nd and March 22nd, again for all the years for which there is data, the above clause calculates that March 22nd is in the following year of December 22nd, and so matches the date accordingly:

Currently the dates are fixed as Jan 1 to Dec 31, but will be parameterized, as shown above.

The HashAggregate from the plan shows a cost of 10006220141.11, which is, I suspect, on the astronomically huge side.

There is a full table scan on the measurement table (itself having neither data nor indexes) being performed. The table aggregates 273 million rows from its child tables.

Question

What is the proper way to index the dates to avoid full table scans?

Options I have considered:

GIN
GiST
Rewrite the WHERE clause
Separate year_taken, month_taken, and day_taken columns to the tables

What are your thoughts?

Thank you!

And what does this sluggish part do? I can not understand it. — Tometzky, May 20 '10 at 12:02
@Tometzky: I have updated the question with a picture showing the parameters for the query. — Dave Jarvis, May 20 '10 at 15:12
You say your tables are partitioned by year and station, but your constraints don't match - the planner may not prune appropriately. Not to mention, if you're crossing all partitions by running it that wide open, then your costs go way up. Reducing your partitions or changing it, *may* help. — rfusca, May 20 '10 at 19:15
@rfusca: The partitions (I hope) are not the issue; the problem is that the full table scan is being performed because the planner is not able to compare calculated dates (from strings) to real dates. Thus it invokes a full table scan. One thought I've had is to pass in two sets of dates. For `Dec 22 to Mar 22`, the query would look at `Dec 22 to Dec 31` of the current year AND `Jan 1 to Mar 22` for the year after the current year. — Dave Jarvis, May 20 '10 at 20:02
What I'm saying is that yes thats true, but you're also likely doing 72 table scan aren't you? — rfusca, May 21 '10 at 13:13
@rfusca: Yes, 72 table scans. I was thinking about splitting the tables just by category and adding year/month/day columns. — Dave Jarvis, May 21 '10 at 15:03
If you don't have a column that is 99% of the time going to be queried, you may not have a case to partition by it. If category is almost always queried, then it sounds like a winner. The query over 72 partitions will take more time than a query over 1 partition with an equal total number of rows - especially if its complex in its plan. — rfusca, May 21 '10 at 15:14
@rfusca: Category is always queried (it does not make sense to combine categories for this problem domain). Also, the district in which stations are queried will almost always be unique. There are 79 districts, which equats to 500k rows per table using 474 partitions. — Dave Jarvis, May 21 '10 at 17:01

score 2 · Accepted Answer · answered May 20 '10 at 15:39

2

Your problem is that you have a where clause depending on a calculation of the date. There is no way the database can use an index if it needs to fetch every row and do a calculation on it before knowing if the date will match.

Unless you rewrite it to be in the form where the database has a fixed range to check which does not depend on the data it is to retrieve you will always have to scan the table.

answered May 20 '10 at 15:39

Cobusve

1,572
10
23

@Cobusive: I was under the impression that GIN and GiST can avoid this. Also, I am looking for the best solution. A full table scan is not an option; there are nearly 300 million rows! :-) MySQL could execute the query in 5 seconds; PostgreSQL has yet to finish. – Dave Jarvis May 20 '10 at 15:49

Tometzky · Answer 2 · 2010-05-24T17:51:35.757

1

Try something like this:

create temporary table test (d date);

insert into test select '1970-01-01'::date+generate_series(1,50*365);

analyze test

create function month_day(d date) returns int as $$
  select extract(month from $1)::int*100+extract(day from $1)::int $$
language sql immutable strict;

create index test_d_month_day_idx on test (month_day(d));

explain analyze select * from test
  where month_day(d)>=month_day('2000-04-01')
  and month_day(d)<=month_day('2000-04-05');

edited May 24 '10 at 17:51

answered May 20 '10 at 11:50

Tometzky

22,573
5
59
73

It was sluggish because the tables were not ordered properly. I fixed this by (1) Reinserting the data in order of primary key; (2) added a CLUSTER'ed index. – Dave Jarvis Jun 01 '10 at 02:21

score 0 · Answer 3 · answered May 21 '10 at 13:19

0

I think to run this efficiently across those partitions I would have your app be alot smarter about the date ranges. Have it generate an actual list of dates to check per partition and then have it generate one query with a UNION between the partitions. It sounds like your data set is pretty static, so a CLUSTER on your date index could greatly improve performance as well.

answered May 21 '10 at 13:19

rfusca

7,435
2
30
34

I've thought about this; I could have the application generate part of the WHERE clause, but then I'd be tightly coupling two completely different systems: the website and the report engine. – Dave Jarvis May 21 '10 at 15:02
You could move the dynamic generation of the where clause into a pl/pgsql function and then you could just have it build the query and its just a "select * from my_proc(parameters..)" - no different than before. – rfusca May 21 '10 at 15:08
That's an interesting proposition as well. Once I get the basic query working (and fast), I'll look into this more. Thank you! – Dave Jarvis May 21 '10 at 17:02

Optimize date query for large child tables: GiST or GIN?

3 Answers3