1

I have a table I created locally to use some of PG's window functions on a dataset that has about 4 million rows (originally a text file). Each row corresponds to a customer order.

CREATE TABLE orders
(
  orderid integer,
  customerid integer,
  orderdate date,
  status text,
  amount money,
  tax money,
  customername text,
  customerstate text

I have the database running locally in a i7 8gb RAM Windows 8 machine. I have btree indexes (indices?) on orderid, customerid and orderdate.

When I run the following query, it takes 300 seconds (appx). I was hoping with some basic tuning I could get it down to a minute, but I'm not a DBA. Anyone have tips?

select orderid, customername, orderdate, 
rank() OVER (PARTITION BY customername ORDER BY orderdate ASC) as cust_ord_nbr
from orders
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Jeffrey James
  • 225
  • 4
  • 10
  • why user customername as the partition key when you have a customerid in the table? – Daniel Vérité Sep 12 '15 at 17:05
  • @DanielVérité - down to 198 seconds now :-) – Jeffrey James Sep 12 '15 at 17:18
  • How many different customers are there ? – wildplasser Sep 12 '15 at 17:47
  • @wildplasser - 3.3 million – Jeffrey James Sep 12 '15 at 18:18
  • In that case, normalising customers into a separate table will not help (much) The only thing that could possibly help would then be a composite index on (customerid, orderdate) – wildplasser Sep 12 '15 at 18:29
  • @wildplasser - thanks, that got me down to 140 seconds. Curious - is the speed due to my hardware? Further, is it CPU, RAM or some other thing? What if I had a datastore 10x this size - just 'wait'? – Jeffrey James Sep 12 '15 at 18:39
  • Possibly just the sheer volume of the data: approx 4M * ~100 ~= 400MB of data. Try prefixing the command with `EXPLAIN ANALYZE` to get the time required for the query *without* the output. – wildplasser Sep 12 '15 at 18:44
  • What's your use case? A human client wouldn't be interested in receiving 4 mio. rows at once. Are you writing to another table? I don't see any constraints, can you provide the exact table definition (what you get with `\d orders` in psql)? – Erwin Brandstetter Sep 13 '15 at 03:32
  • @ErwinBrandstetter - no production use case, mainly data auditing prep before further transformation and analysis. – Jeffrey James Sep 13 '15 at 03:37
  • Which columns do you *need* in the result? Do you need `customername` or would `customerid` do the job as well? Where do you send the result? You don't just query 4 mio rows, you have to put the result *somewhere*. Is it a read-only table? Are you the only user? Or do you have (concurrent) write access, too? Consider instructions for [postgresql-performance] questions: http://stackoverflow.com/tags/postgresql-performance/info – Erwin Brandstetter Sep 13 '15 at 03:47

1 Answers1

2

Covering index

Partition by customerid, like @Daniel commented. integer is smaller and cheaper to sort. If you don't need customername in the result, replace it with customerid completely.

A multicolumn index can help (like @wildplasser commented). If it's a (mostly) read-only table, a "covering" index that allows index-only scans would be even faster - especially if included columns are small:

CREATE INDEX orders_nbr_idx ON orders (customerid, orderdate, orderid);

Adding orderid to the index only makes sense if you get index-only scans out of it. If you need the customername, add that, too. More:

If it's (mostly) a read-only table, execute the expensive query once and save the snapshot as MATERIALIZED VIEW for reuse ...

Peanuts

You can do a couple of small things to reduce the memory footprint. After playing column tetris, this would save 0-7 bytes per row currently lost to padding:

CREATE TABLE orders (
  orderid integer,
  customerid integer,
  amount money,
  tax money,
  orderdate date,
  status text,
  customername text,
  customerstate text
  );

If you write the result to another table (or MATERIALIZED VIEW), it would save a bit to optimize the query in a similar fashion. rank() produces a bigint, by casting to int you save 8 bytes per row (4 + 4 padding):

SELECT orderid, customername, orderdate
    -- orderid, customerid, orderdate  -- good enough?
     , rank() OVER (PARTITION BY customerid
                    ORDER BY orderdate)::int AS cust_ord_nbr
FROM   orders;
Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228