I had a similar case in a table with near 1M rows.
So I created a index b-tree on visited_at (datetime field) and tried
a query for all rows:
explain analyze select mes,count(usuario) as usuarios
from (
SELECT distinct coalesce(usuario, ip) as usuario, (extract(year from visited_at), extract(month from visited_at)) AS mes
FROM pageview
) as usuarios
group by 1
order by 1
I got:
GroupAggregate (cost=445468.78..451913.54 rows=200 width=64) (actual time=31027.876..31609.754 rows=8 loops=1)
-> Sort (cost=445468.78..447616.37 rows=859035 width=64) (actual time=31013.501..31439.350 rows=358514 loops=1)
Sort Key: usuarios.mes
Sort Method: external merge Disk: 24000kB
-> Subquery Scan on usuarios (cost=247740.16..263906.75 rows=859035 width=64) (actual time=23121.403..28200.175 rows=358514 loops=1)
-> Unique (cost=247740.16..255316.40 rows=859035 width=48) (actual time=23121.400..28129.538 rows=358514 loops=1)
-> Sort (cost=247740.16..250265.57 rows=1010166 width=48) (actual time=23121.399..27559.241 rows=1010702 loops=1)
Sort Key: (COALESCE(pageview.usuario, (pageview.ip)::text)), (ROW(date_part('year'::text, pageview.visited_at), date_part('month'::text, pageview.visited_at)))
Sort Method: external merge Disk: 66944kB
-> Seq Scan on pageview (cost=0.00..84842.49 rows=1010166 width=48) (actual time=0.012..1909.324 rows=1010702 loops=1)
Total runtime: 31632.012 ms
That means no improvement over the query before the index.
But so I reduced the rows to current_date-31
explain analyze select mes,count(usuario) as usuarios
from (
SELECT distinct coalesce(usuario, ip) as usuario, (extract(year from visited_at), extract(month from visited_at)) AS mes
FROM pageview
where visited_at > current_date - 31
) as usuarios
group by 1
order by 1
and got
-> Sort (cost=164735.62..165310.93 rows=230125 width=64) (actual time=9532.343..9602.743 rows=90871 loops=1)
Sort Key: usuarios.mes
Sort Method: external merge Disk: 5872kB
-> Subquery Scan on usuarios (cost=122598.79..126929.62 rows=230125 width=64) (actual time=7251.344..9178.901 rows=90871 loops=1)
-> Unique (cost=122598.79..124628.37 rows=230125 width=48) (actual time=7251.343..9157.837 rows=90871 loops=1)
-> Sort (cost=122598.79..123275.32 rows=270610 width=48) (actual time=7251.341..8932.541 rows=294915 loops=1)
Sort Key: (COALESCE(pageview.usuario, (pageview.ip)::text)), (ROW(date_part('year'::text, pageview.visited_at), date_part('month'::text, pageview.visited_at)))
Sort Method: external merge Disk: 18864kB
-> Bitmap Heap Scan on pageview (cost=5073.60..81528.85 rows=270610 width=48) (actual time=111.950..1877.603 rows=294915 loops=1)
Recheck Cond: (visited_at > (('now'::cstring)::date - 31))
Rows Removed by Index Recheck: 338268
-> Bitmap Index Scan on visited_at_index (cost=0.00..5005.94 rows=270610 width=0) (actual time=109.874..109.874 rows=294915 loops=1)
Index Cond: (visited_at > (('now'::cstring)::date - 31))
Total runtime: 9687.460 ms
I got a small improvement casting datetime to date (visited_at::date)
explain analyze select mes,count(usuario) as usuarios
from (
SELECT distinct coalesce(usuario, ip) as usuario, (extract(year from visited_at::date), extract(month from visited_at::date)) AS mes
FROM pageview
where visited_at::date > current_date - 31
) as usuarios
group by 1
order by 1
and got
GroupAggregate (cost=201976.97..204126.56 rows=200 width=64) (actual time=9040.196..9102.098 rows=2 loops=1)
-> Sort (cost=201976.97..202692.83 rows=286345 width=64) (actual time=9035.624..9058.457 rows=88356 loops=1)
Sort Key: usuarios.mes
Sort Method: external merge Disk: 5704kB
-> Subquery Scan on usuarios (cost=149102.66..154491.53 rows=286345 width=64) (actual time=7511.231..8840.270 rows=88356 loops=1)
-> Unique (cost=149102.66..151628.08 rows=286345 width=48) (actual time=7511.229..8823.647 rows=88356 loops=1)
-> Sort (cost=149102.66..149944.47 rows=336722 width=48) (actual time=7511.227..8666.667 rows=287614 loops=1)
Sort Key: (COALESCE(pageview.usuario, (pageview.ip)::text)), (ROW(date_part('year'::text, ((pageview.visited_at)::date)::timestamp without time zone), date_part('month'::text, ((pageview.visited_at)::date)::timestamp without time zone)))
Sort Method: external merge Disk: 18408kB
-> Seq Scan on pageview (cost=0.00..97469.57 rows=336722 width=48) (actual time=0.018..1946.139 rows=287614 loops=1)
Filter: ((visited_at)::date > (('now'::cstring)::date - 31))
Rows Removed by Filter: 722937
Total runtime: 9108.644 ms
That are the tweaks that worked to me:
1) index b-tree (mainly)
2) casting to date (small difference)
10s still is a big time to response to user.
So my solution was create table month_users and use once
insert from month_users select mes,count(usuario) as usuarios
from (
SELECT distinct coalesce(usuario, ip) as usuario, (extract(year from visited_at), extract(month from visited_at)) AS mes
FROM pageview
) as usuarios
group by 1
order by 1
and use
select * from month_users
results:
Seq Scan on usuarios_mes (cost=0.00..21.30 rows=1130 width=42) (actual time=0.302..0.304 rows=8 loops=1)
Total runtime: 0.336 ms
A acceptable result now!
A final solution still have to consider how update the table results regularly.