0

I have a Postgres DB running 7.4 (Yeah we're in the midst of upgrading)

I have four separate queries to get the Daily, Monthly, Yearly and Lifetime record counts

SELECT COUNT(field)
FROM database
WHERE date_field
    BETWEEN DATE_TRUNC('DAY' LOCALTIMESTAMP) 
    AND DATE_TRUNC('DAY' LOCALTIMESTAMP) + INTERVAL '1 DAY'

For Month just replace the word DAY with MONTH in the query and so on for each time duration.

Looking for ideas on how to get all the desired results with one query and any optimizations one would recommend.

Thanks in advance!

NOTE: date_field is timestamp without time zone

UPDATE:

Sorry I do filter out records with additional query constraints, just wanted to give the gist of the date_field comparisons. Sorry for any confusion

thirtydot
  • 224,678
  • 48
  • 389
  • 349
Phill Pafford
  • 83,471
  • 91
  • 263
  • 383
  • The most obvious answer is to join the selects of the other queries. It's still 4 queries, but you do it in one call which reduces transactional overhead. – corsiKa May 25 '11 at 16:33

3 Answers3

1

I have some idea of using prepared statements and simple statistics (record_count_t) table for that:

-- DROP TABLE IF EXISTS record_count_t;
-- DEALLOCATE record_count;
-- DROP FUNCTION updateRecordCounts();

CREATE TABLE record_count_t (type char, count bigint);
INSERT INTO record_count_t (type) VALUES ('d'), ('m'), ('y'), ('l');

PREPARE record_count (text) AS
UPDATE record_count_t SET count =
(SELECT COUNT(field)
FROM database
WHERE
CASE WHEN $1 <> 'l' THEN
    DATE_TRUNC($1, date_field) = DATE_TRUNC($1, LOCALTIMESTAMP)
ELSE TRUE END)
WHERE type = $1;

CREATE FUNCTION updateRecordCounts() RETURNS void AS
$$
    EXECUTE record_count('d');
    EXECUTE record_count('m');
    EXECUTE record_count('y');
    EXECUTE record_count('l');
$$
LANGUAGE SQL;

SELECT updateRecordCounts();
SELECT type,count FROM record_count_t;

Use updateRecordCounts() function any time you need update statistics.

Grzegorz Szpetkowski
  • 36,988
  • 6
  • 90
  • 137
0

I'd guess that it is not possible to optimize this any further than it already is.

If you're collecting daily/monthly/yearly stats, as I'm assuming you are doing, one option (after upgrading, of course) is a with statement and the relevant joins, e.g.:

with daily_stats as (
(what you posted)
),
monthly_stats as (
(what you posted monthly)
),
etc.
select daily_stats.stats,
       monthly_stats.stats,
       etc.
stats
left join yearly_stats on ...
left join monthly_stats on ...
left join daily_stats on ...

However, that will actually perform less well than running each query separately in a production environment, because you'll introduce left joins in the DB which could be done just as well in the middleware (i.e. show daily, then monthly, then yearly and finally lifetime stats). (If not better, since you'll be avoiding full table scans.)

By keeping things as if, you'll save the precious DB resources to deal with reads and writes on actual data. The tradeoff (less network traffic between your database and your app) is almost certainly not worth it.

Denis de Bernardy
  • 75,850
  • 13
  • 131
  • 154
-1

Yikes! Don't do this!!! Not because you can't do what you're asking, but because you probably shouldn't be doing what you're asking in this manner. I'm guessing the reason you've got date_field in your example is because you've got a date_field attached to a user or some other meta-data.

Think about it: you are asking PostgreSQL to scan 100% of the records relevant to a given user. Unless this is a one-time operation, you almost assuredly do not want to do this. If this is a one-time operation and you are planning on caching this value as a meta-data, then who cares about the optimizations? Space is cheap and will save you heaps of execution time down the road.

You should add 4x per-user (or whatever it is) meta-data fields that help sum up the data. You have two options, I'll let you figure out how to use this so that you keep historical counts, but here's the easy version:

CREATE TABLE user_counts_only_keep_current (
  user_id , -- Your user_id
  lifetime INT DEFAULT 0,
  yearly INT DEFAULT 0,
  monthly INT DEFAULT 0,
  daily INT DEFAULT 0,
  last_update_utc TIMESTAMP WITH  TIME ZONE,
  FOREIGN KEY(user_id) REFERENCES "user"(id)
);
CREATE UNIQUE INDEX this_tbl_user_id_udx ON user_counts_only_keep_current(user_id);

Setup some stored procedures that zero out individual columns if last_update_utc doesn't match the current day according to NOW(). You can get creative from here, but incrementing records like this is going to be the way to go.

Handling of time series data in any relational database requires special handling and maintenance. Look in to PostgreSQL's table inheritance if you want good temporal data management.... but really, don't do whatever it is you're about to do to your application because it's almost certainly going to result in bad things(tm).

Sean
  • 9,888
  • 4
  • 40
  • 43
  • sry maybe I should have mentioned that this is not all the query, I do filter out a ton of records with other query conditions. Just wanted to state the basic query functionality – Phill Pafford May 25 '11 at 18:35
  • Big sweeping scans of date data like this attached to the words "performance" almost always mean the app is designed wrong. Use a piece of meta-data elsewhere that you increment and poll this data from. Really. :~] – Sean May 25 '11 at 18:38
  • "Think about it: you are asking PostgreSQL to scan 100% of the records relevant to a given user. Unless this is a one-time operation" -- not so. an index on date_field will handle this quite well, even with pg 7.4... – Denis de Bernardy May 25 '11 at 19:00
  • Grr... no. That's why I used the word `relevant`. I know Pg's smart enough to index scan only the dates specified in the qual, but it will cause 100% of the physical pages backing the INDEX and the TABLE to be scanned. In a cache-miss scenario, each tuple scanned requires the backing page to be `read(2)` in, which puts pressure on a database's cache (yay for 2Q caches!). – Sean May 25 '11 at 21:30