postgresql - are my indexes or column types slowing down my query?

Question

I have a table I created locally to use some of PG's window functions on a dataset that has about 4 million rows (originally a text file). Each row corresponds to a customer order.

CREATE TABLE orders
(
  orderid integer,
  customerid integer,
  orderdate date,
  status text,
  amount money,
  tax money,
  customername text,
  customerstate text

I have the database running locally in a i7 8gb RAM Windows 8 machine. I have btree indexes (indices?) on orderid, customerid and orderdate.

When I run the following query, it takes 300 seconds (appx). I was hoping with some basic tuning I could get it down to a minute, but I'm not a DBA. Anyone have tips?

select orderid, customername, orderdate, 
rank() OVER (PARTITION BY customername ORDER BY orderdate ASC) as cust_ord_nbr
from orders

why user customername as the partition key when you have a customerid in the table? — Daniel Vérité, Sep 12 '15 at 17:05
In that case, normalising customers into a separate table will not help (much) The only thing that could possibly help would then be a composite index on (customerid, orderdate) — wildplasser, Sep 12 '15 at 18:29
@wildplasser - thanks, that got me down to 140 seconds. Curious - is the speed due to my hardware? Further, is it CPU, RAM or some other thing? What if I had a datastore 10x this size - just 'wait'? — Jeffrey James, Sep 12 '15 at 18:39
Possibly just the sheer volume of the data: approx 4M * ~100 ~= 400MB of data. Try prefixing the command with `EXPLAIN ANALYZE` to get the time required for the query *without* the output. — wildplasser, Sep 12 '15 at 18:44
What's your use case? A human client wouldn't be interested in receiving 4 mio. rows at once. Are you writing to another table? I don't see any constraints, can you provide the exact table definition (what you get with `\d orders` in psql)? — Erwin Brandstetter, Sep 13 '15 at 03:32
@ErwinBrandstetter - no production use case, mainly data auditing prep before further transformation and analysis. — Jeffrey James, Sep 13 '15 at 03:37
Which columns do you *need* in the result? Do you need `customername` or would `customerid` do the job as well? Where do you send the result? You don't just query 4 mio rows, you have to put the result *somewhere*. Is it a read-only table? Are you the only user? Or do you have (concurrent) write access, too? Consider instructions for [postgresql-performance] questions: http://stackoverflow.com/tags/postgresql-performance/info — Erwin Brandstetter, Sep 13 '15 at 03:47

score 2 · Accepted Answer · edited May 23 '17 at 11:44

Covering index

Partition by customerid, like @Daniel commented. integer is smaller and cheaper to sort. If you don't need customername in the result, replace it with customerid completely.

A multicolumn index can help (like @wildplasser commented). If it's a (mostly) read-only table, a "covering" index that allows index-only scans would be even faster - especially if included columns are small:

CREATE INDEX orders_nbr_idx ON orders (customerid, orderdate, orderid);

Adding orderid to the index only makes sense if you get index-only scans out of it. If you need the customername, add that, too. More:

How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?

If it's (mostly) a read-only table, execute the expensive query once and save the snapshot as MATERIALIZED VIEW for reuse ...

Peanuts

You can do a couple of small things to reduce the memory footprint. After playing column tetris, this would save 0-7 bytes per row currently lost to padding:

CREATE TABLE orders (
  orderid integer,
  customerid integer,
  amount money,
  tax money,
  orderdate date,
  status text,
  customername text,
  customerstate text
  );

If you write the result to another table (or MATERIALIZED VIEW), it would save a bit to optimize the query in a similar fashion. rank() produces a bigint, by casting to int you save 8 bytes per row (4 + 4 padding):

SELECT orderid, customername, orderdate
    -- orderid, customerid, orderdate  -- good enough?
     , rank() OVER (PARTITION BY customerid
                    ORDER BY orderdate)::int AS cust_ord_nbr
FROM   orders;

Thanks for the tip on materialized view! Going to try that out and report back. — Jeffrey James, Sep 14 '15 at 23:33

postgresql - are my indexes or column types slowing down my query?

1 Answers1

Covering index

Peanuts