1

I have the following tables in timescale db

Fact table:

fact_program_event_audience (hypertable table):

create table v1.fact_program_event_audience
(
    event_id                           uuid,
    content_id                         text,
    date_of_transmission               date not null,
    station_code                       integer,
    start_time                         timestamp,
    end_time                           timestamp,
    duration_is_minutes                integer,
    program_type                       smallint,
    program_name                       text,
    ...
    audience_category_number           integer,
    live_viewing                       integer,
    consolidated_viewing               integer,
    live_tv_viewing_excluding_playback integer,
    consolidated_total_tv_viewing      integer
);

create index fact_program_event_audience_v1_date_of_transmission_idx
    on v1.fact_program_event_audience_v1 (date_of_transmission desc);

create index program_event_audience_v1_date_of_transmission_event_id_idx
    on v1.fact_program_event_audience_v1 (date_of_transmission, event_id);

Dimension tables:

v1.dimension_program_content (PostgreSQL table):

create table v1.dimension_program_content (
  content_id text not null,
  content_name text not null,
  episode_number numeric not null,
  episode_name text not null
);
create index dimension_program_content_content_id_idx on dimension_program_content using btree (content_id);
    

v1.dimension_audience_category (PostgreSQL table):

create table v1.dimension_audience_category (
  category_number numeric not null,
  category_description text not null,
  target_size_in_hundreds numeric not null
);
create index dimension_audience_category_category_number on dimension_audience_category using btree (category_number);

v1.dimension_db2_station (PostgreSQL table):

create table v1.dimension_db2_station (
  stationcode numeric,
  station15charname text,
  stationname text
);
create index dimension_db2_station_station_code on dimension_db2_station using btree (db2stationcode);

Hypertable interval is '1 day'

Each day I have approximately new 2 000 000 - 5 000 000 rows in fact table. Each dimension table has a maximum of 1000 rows, dimension_audience_category has only 60 rows (it will be important later)

I have the following query:

SELECT
    event.start_time,
    event.event_id,
    event.db2_station_code,
    event.end_time,
    event.panel_code,
    event.duration_is_minutes,
    event.date_of_transmission,
    event.area_flags,
    event.barb_content_id,
    station.db2stationname,
    event.array,
    content.episode_name,
    content.episode_number,
    content.content_name
FROM
(
    SELECT
        event.date_of_transmission,
        event.event_id,
        event.start_time,
        event.end_time,
        event.db2_station_code,
        event.panel_code,
        event.duration_is_minutes,
        event.area_flags,
        event.barb_content_id,
        json_agg(json_build_object('d', category.category_description)) as "array"
    FROM v1.fact_program_event_audience_v1 event
    JOIN v1.dimension_audience_category category ON event.audience_category_number = category.category_number
    WHERE
        date_of_transmission = '2022-09-19T00:00:00Z'
    GROUP BY
        date_of_transmission,
        event.event_id,
        event.date_of_transmission,
        event.start_time,
        event.end_time,
        event.db2_station_code,
        event.panel_code,
        event.duration_is_minutes,
        event.area_flags,
        event.barb_content_id
    ORDER BY 
        date_of_transmission,
        event.event_id
    LIMIT 10000
) as event
JOIN v1.dimension_db2_station station ON station.stationcode = event.station_code
JOIN v1.dimension_program_content content ON content.content_id = event.content_id;

This query takes (cold cache / first ran) approximately 10 seconds, in the query plan I can see why: because I have a separate chunk per day and I don't have an index limit inside such chunk, but I have preordered data because of my index (date_of_transmission, event_id): Why it takes so much time to just select first 10k rows inside single chunk?

This is execution plan: https://pastebin.com/4VUZB0Fd

Here is a picture: enter image description here

I can see buffers on chunk Index Scan stage:

  1. Shared hit: 4.3 GB
  2. Shared read: 491 MB

Seems like index is not ideal, my full chunk size is 550MB for the whole day (2022-09-19T00:00:00Z)

I also tried to add AND 'event_id > smallest event_id value in particular day' in WHERE, seems like it's not help at all.

I'm using Timescale DB cloud (16 GB RAM | 4 Cores)

Thanks.

Vlad Kuzmich
  • 21
  • 1
  • 5
  • My first shot would be to drop the current indexes on fact_program_event_audience_v1 and create a new one, covering all columns that you mention in the GROUP BY. In that order. Maybe also include audience_category_number – Frank Heikens Sep 21 '22 at 08:39
  • Thanks, I will try in now. I didn't do that because my for the same event_id I have absolutely the same values excluding audience_category_number. But I will try that and will you know. – Vlad Kuzmich Sep 21 '22 at 08:45
  • @FrankHeikens Still the same https://explain.tensor.ru/archive/explain/d180eee3eaf90941aada9b96686bea41:0:2022-09-21 – Vlad Kuzmich Sep 21 '22 at 08:59

0 Answers0