0

I have a table with a few million rows, where rows are quite often inserted and even more often fetched.

Time of row insertion is not critical, but the time of fetching is because it is serving a website. Therefore, I created already an index, that helped to fetch much faster.

The queries are pretty simple and contain no JOINs.

The problem happens with SELECT queries. The same SELECT query will run every few seconds to check for new or updated rows once a user has performed a search. However, it is not strange that the SELECT query runs for 50 seconds the first time and afterwards those same queries take less than 1 second.

That makes me think that the problem is not the SELECT statement itself, but something else.

The table is:

CREATE TABLE all_legs (
                carrier TEXT,
                dep_hub TEXT,
                arr_hub TEXT,
                dep_dt TIMESTAMP WITH TIME ZONE,
                arr_dt TIMESTAMP WITH TIME ZONE,
                price_ct INTEGER,
                ... 5 more cols ...,
                PRIMARY KEY (carrier, dep_hub, arr_hub, dep_dt, arr_dt, ...3 other cols...)
                )

INDEX is:

CREATE INDEX IF NOT EXISTS fetch_index ON all_legs(dep_dt, LEFT(dep_hub::text, 6), LEFT(arr_hub::text, 6));

SELECT query:

SELECT * FROM all_legs
                    WHERE dep_dt >= %s
                    AND dep_dt < %s
                    AND (LEFT(dep_hub::text, 6) = %s AND LEFT(arr_hub::text, 6) = %s)

Such a case does not always happens and is therefore difficult to replicate. Here there is an EXPLAIN statement from my local database, which has less data than the one on Heroku and run actually quite fast:

Index Scan using tz_idx on all_legs  (cost=0.41..111184.33 rows=1 width=695) (actual time=128.100..136.690 rows=20 loops=1)
  Index Cond: (("left"(dep_hub, 6) = 'ES-PMI'::text) AND ("left"(arr_hub, 6) = 'ES-MAD'::text))
  Filter: ((dep_dt)::date = '2018-01-19'::date)
  Rows Removed by Filter: 271
Planning time: 3.798 ms
Execution time: 138.525 ms

Why is the first time much slower and how can I reduce the running time of the first query?

J0ANMM
  • 7,849
  • 10
  • 56
  • 90
  • I mean, I will run the same query more than once because I'm checking if there are new or updated rows. So the first time I run a new query it takes long, the following times are OK. I updated the question to clarify it. – J0ANMM Jan 15 '18 at 08:55
  • Whoever is voting for close, it would be kind if you explained why... – J0ANMM Jan 15 '18 at 09:01
  • "*Why is the first time much slower*" - because of caching done by Postgres and the file system –  Jan 15 '18 at 09:04
  • 1
    It is very common that first run of query takes much longer. Most common reason is that necessary data blocks from table or index are not cached in memory (either in Linux cache or in pg shared buffers) and must be first read from the disk. When you repeat query then most of data blocks are already cached so disk IO is very small or none. – JosMac Jan 15 '18 at 09:06
  • And how to make the first run faster? Would it help to split the table into several smaller ones? – J0ANMM Jan 15 '18 at 09:25
  • What values are you passing for `(LEFT(dep_hub::text, 6) = %s AND LEFT(arr_hub::text, 6) = %s)`? That looks like it could be done more simply with a `LIKE` statement. – Simon Brahan Jan 15 '18 at 09:36
  • @J0ANMM - Maybe partitioning can help - but it increase planning time - with too much partitions.You can take more RAM. – Pavel Stehule Jan 15 '18 at 09:36
  • @SimonBrahan I am passing strings such as `'ES-PMI'`. Would a `LIKE` statement be faster? – J0ANMM Jan 15 '18 at 09:38
  • Are You running Your Postgres server on a real machine or on some kind of virtualization? – Roman Hocke Jan 15 '18 at 09:38
  • @PavelStehule As all rows have a date, I thought about partitioning by month. Each user search is for a specific date, so I would only need to fetch one table. Would it still take more RAM? – J0ANMM Jan 15 '18 at 09:39
  • @RomanHocke Postgres is run on Heroku. Although the `EXPLAIN` statement I showed is from my local DB. – J0ANMM Jan 15 '18 at 09:40
  • 1
    `carrier TEXT, dep_hub TEXT, arr_hub TEXT,` Why are these PK columns all text fields? What is their cadinality? Could you squeeze them out to separate tables(s) and refer to these via a surrogate keys? – joop Jan 15 '18 at 09:50
  • @joop I made it like this at the beginning to have things as simple as possible. If the performance is highly improved, there would be no problem to move them to squeeze them out to separate tables(s) and refer to these via surrogate keys. There must be around 1000 different possible elements for each of these fields. – J0ANMM Jan 15 '18 at 09:56
  • For a table with "a few million rows", these fields count as low-cardianality. I also have the impression that `deb_hub` and `arr_hub` refer to the same domain. – joop Jan 15 '18 at 10:00
  • @joop Yes, that is correct. So if I understood it, moving them to another table and referring to them with integers would notably speed-up the `SELECT` queries? – J0ANMM Jan 15 '18 at 10:02
  • It will reduce the memory footprint of your table(s), notably the index(es) (which, at the moment, will probably be larger than the table itself) – joop Jan 15 '18 at 10:06
  • @joop My concern here is that right now I am selecting with one query several hubs at `dep_hub` and `arr_hub` that have the same 6 first characters. If I convert them to integers, that would not be possible anymore and I would have to do something like `...WHERE... dep_hub IN (list_of_hubs)...`. Would that not be slower? – J0ANMM Jan 15 '18 at 10:13
  • No. Because you have to search in a table of 1000 items, instead of millions of them. – joop Jan 15 '18 at 10:28
  • 1
    To make the first query faster, you need to fill the cache during startup. You can either do that by simply running a dummy query against the table or by using the [pgprewarm](https://www.postgresql.org/docs/current/static/pgprewarm.html) extension –  Jan 15 '18 at 10:30
  • @J0ANMM who know? The tests are necessary – Pavel Stehule Jan 15 '18 at 11:08

2 Answers2

1

The LEFT functions in your query and index are probably adding unneeded complexity; you can use wildcard matching with a LIKE statement to get the same thing.

SELECT * FROM all_legs
WHERE dep_dt >= '2018-01-19'
AND dep_dt < '2018-01-20'
AND dep_hub LIKE 'ES-PMI%'
AND arr_hub LIKE 'ES-MAD%'

Add a % to the end of your last two parameters.

With this, you should also be able to speed up the query by removing the index involving the LEFT function, and just index the columns normally.

Simon Brahan
  • 2,016
  • 1
  • 14
  • 22
1
  • Example code for squeezing out a single column (dep_hub)
  • If your {dep_hub,arr_hub} both refer to the same domain, you'll have to change things a bit
  • you also have to redefine the Primary Key,
  • and [maybe] add some functional indexes on the squeeze-out table

    -- [empty] table to contain the "squeezed out" domain
CREATE TABLE dep_hub
    ( id SERIAL NOT NULL PRIMARY KEY
    ,  dep_hub varchar
    , UNIQUE (dep_hub)
    );

   -- This is done in the chained insert/update
-- INSERT INTO dep_hub(dep_hub)
-- SELECT DISTINCT dep_hub
-- FROM all_legs ;


    -- an index may speedup the final update
    -- (the index will be dropped automatically
    -- once the column is dropped)
CREATE INDEX ON all_legs (dep_hub);

    -- The original table needs a "link" to the new table
ALTER TABLE all_legs
    ADD column dep_hub_id INTEGER -- NOT NULL
    REFERENCES dep_hub(id)
    ;

    -- FK constraints are helped a lot by a supportive index.
CREATE INDEX all_legs_dep_hub_fk ON all_legs (dep_hub_id);

    -- Chained query to:
    -- * populate the domain table
    -- * initialize the FK column in the original table
WITH src AS (
    INSERT INTO dep_hub(dep_hub)
    SELECT DISTINCT a.dep_hub
    FROM all_legs a
    RETURNING *
    )
UPDATE all_legs dst
SET  dep_hub_id = src.id
FROM src
WHERE src.dep_hub = dst.dep_hub
    ;

    -- Now that we have the FK pointing to the new table,
    -- we can drop the redundant column.
ALTER TABLE all_legs DROP COLUMN dep_hub;
joop
  • 4,330
  • 1
  • 15
  • 26
  • Thanks! I could afford to drop the complete table and creating it again. Would that simplify things? dep_hub and arr_hub would refer to the same domain – J0ANMM Jan 15 '18 at 10:39
  • 1
    A better strategy would be: do the conversion in-place, then rename the table, and create the new table by `select ... from old_table` (omitting the xxx_hub fields), put the PK and FK's on it, and finally drop the old table. (and: maybe replace it by a view, joining the xxx_hub tables twice ...) – joop Jan 15 '18 at 10:47