I have the following tables in timescale db
Fact table:
fact_program_event_audience (hypertable table):
create table v1.fact_program_event_audience
(
event_id uuid,
content_id text,
date_of_transmission date not null,
station_code integer,
start_time timestamp,
end_time timestamp,
duration_is_minutes integer,
program_type smallint,
program_name text,
...
audience_category_number integer,
live_viewing integer,
consolidated_viewing integer,
live_tv_viewing_excluding_playback integer,
consolidated_total_tv_viewing integer
);
create index fact_program_event_audience_v1_date_of_transmission_idx
on v1.fact_program_event_audience_v1 (date_of_transmission desc);
create index program_event_audience_v1_date_of_transmission_event_id_idx
on v1.fact_program_event_audience_v1 (date_of_transmission, event_id);
Dimension tables:
v1.dimension_program_content (PostgreSQL table):
create table v1.dimension_program_content (
content_id text not null,
content_name text not null,
episode_number numeric not null,
episode_name text not null
);
create index dimension_program_content_content_id_idx on dimension_program_content using btree (content_id);
v1.dimension_audience_category (PostgreSQL table):
create table v1.dimension_audience_category (
category_number numeric not null,
category_description text not null,
target_size_in_hundreds numeric not null
);
create index dimension_audience_category_category_number on dimension_audience_category using btree (category_number);
v1.dimension_db2_station (PostgreSQL table):
create table v1.dimension_db2_station (
stationcode numeric,
station15charname text,
stationname text
);
create index dimension_db2_station_station_code on dimension_db2_station using btree (db2stationcode);
Hypertable interval is '1 day'
Each day I have approximately new 2 000 000 - 5 000 000 rows in fact table. Each dimension table has a maximum of 1000 rows, dimension_audience_category has only 60 rows (it will be important later)
I have the following query:
SELECT
event.start_time,
event.event_id,
event.db2_station_code,
event.end_time,
event.panel_code,
event.duration_is_minutes,
event.date_of_transmission,
event.area_flags,
event.barb_content_id,
station.db2stationname,
event.array,
content.episode_name,
content.episode_number,
content.content_name
FROM
(
SELECT
event.date_of_transmission,
event.event_id,
event.start_time,
event.end_time,
event.db2_station_code,
event.panel_code,
event.duration_is_minutes,
event.area_flags,
event.barb_content_id,
json_agg(json_build_object('d', category.category_description)) as "array"
FROM v1.fact_program_event_audience_v1 event
JOIN v1.dimension_audience_category category ON event.audience_category_number = category.category_number
WHERE
date_of_transmission = '2022-09-19T00:00:00Z'
GROUP BY
date_of_transmission,
event.event_id,
event.date_of_transmission,
event.start_time,
event.end_time,
event.db2_station_code,
event.panel_code,
event.duration_is_minutes,
event.area_flags,
event.barb_content_id
ORDER BY
date_of_transmission,
event.event_id
LIMIT 10000
) as event
JOIN v1.dimension_db2_station station ON station.stationcode = event.station_code
JOIN v1.dimension_program_content content ON content.content_id = event.content_id;
This query takes (cold cache / first ran) approximately 10 seconds, in the query plan I can see why: because I have a separate chunk per day and I don't have an index limit inside such chunk, but I have preordered data because of my index (date_of_transmission, event_id): Why it takes so much time to just select first 10k rows inside single chunk?
This is execution plan: https://pastebin.com/4VUZB0Fd
Here is a picture: enter image description here
I can see buffers on chunk Index Scan stage:
- Shared hit: 4.3 GB
- Shared read: 491 MB
Seems like index is not ideal, my full chunk size is 550MB for the whole day (2022-09-19T00:00:00Z)
I also tried to add AND 'event_id > smallest event_id value in particular day' in WHERE, seems like it's not help at all.
I'm using Timescale DB cloud (16 GB RAM | 4 Cores)
Thanks.