2

I'm working with PostgreSQL (I'm a rookie in the database world) and I'd like to know your opinion on the efficiency of this kind of queries I found in the code I'm working with. These queries have a lot of JOINs, and one of them (bold font) has many rows by request. This forces us to GROUP BY request.id in order to obtain a row by request and a field (bold font) with all this rows data.

I think this kind of queries has to lose lots of time looking for all these maximums, but I can't figure an alternative way of doing this. Any ideas on its efficiency and how to improve it?

SELECT
  request.id AS id,
  max(request_type.name) AS request_type,
  to_char(max(request.timestamp),'DD/mm/YYYY HH24:mi') AS timestamp,
  to_char(max(request.timestamp),'YYYY-mm-DD') AS timestamp_filtering,
  max(state.name) AS request_state,
  max(users.name || ' ' || COALESCE(users.surname,'')) AS create_user,
  max(request.id_create_user) AS id_create_user,
  max(enterprise.name) AS enterprise,
  max(cause_issue.name) AS cause,
  max(request_movements.id_request_state) AS id_state,
  array_to_string(array_agg(DISTINCT act_code.name || '/' || req_res.act_code), ', ') AS act_code, /* here */
  max(revised.code) AS state_revised, 
  max(request_shipment.warehouse) AS warehouse,
  max(req_res.id_warehouse) AS id_warehouse
FROM
  request
  LEFT JOIN users
    ON users.id=request.id_create_user
  LEFT JOIN enterprise
    ON users.id_enterprise=enterprise.id
  LEFT JOIN request_movements
    ON request_movements.id=request.id_request_movement
  LEFT JOIN request_versions
    ON request_versions.id = request_movements.id_version
  LEFT JOIN state
    ON request_movements.id_request_state=state.id
  INNER JOIN request_type
    ON request.id_request_type=request_type.id
  LEFT JOIN cause_issue
    ON request.id_cause_issue=cause_issue.id
  LEFT JOIN request_reserve req_res
    ON req_res.id_request = request.id /* here */
  LEFT JOIN act_code
    ON req_res.id_act_code=act_code.id
  LEFT JOIN request_shipment
    ON (request_shipment.id_request=request.id)
  LEFT JOIN warehouse_enterprise
    ON (warehouse_enterprise.id = request_shipment.id_warehouse_enterprise)
  LEFT JOIN revised
    ON (revised.id = request_shipment.id_revised)
WHERE
  request.id_request_type = "any_type"  
GROUP BY
  request.id

The EXPLAIN returns this.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Pablo Novo
  • 43
  • 4
  • Did you already run [EXPLAIN](http://www.postgresql.org/docs/8.4/static/sql-explain.html) on this query? – fvu Nov 13 '12 at 08:32
  • yes, but as I said I'm new in this and I haven't read enough to take advantage of this kind of analysis – Pablo Novo Nov 13 '12 at 08:38
  • You may want to add the output of explain then, it helps people analyze your situation. – fvu Nov 13 '12 at 08:39
  • 1
    The best way to publish an execution plan is to upload it to http://explain.depesz.com –  Nov 13 '12 at 08:42
  • I edited in order to correct one thing in the query and add the EXPLAIN. – Pablo Novo Nov 13 '12 at 08:51
  • Info at http://stackoverflow.com/tags/postgresql-performance/info – Craig Ringer Nov 13 '12 at 08:58
  • 1
    That's a lot of left joins and aggregates o_O – d11wtq Nov 13 '12 at 11:31
  • Frankly, that's a not that slow a plan for a query that joins that many tables with outer joins. The rowcount estimates are pretty reasonable, so you don't have statistics issues. This just looks like a big, complicated query. – Craig Ringer Nov 13 '12 at 13:33
  • One more thing. My real purpose was to get a faster alternative query. I think that having so many "maximums" are making slower the query and in this case I don't need any maximum (an "any" aggregate function that returns the first value would be useful in order to avoid searching for a maximum that is always the same!) – Pablo Novo Nov 13 '12 at 14:19
  • You question the cost of the max() calc, but the real cost is in all the left joins, mostly the seqscan+hash of "echo" and "oscar". The GroupAggregate step took only 2.5ms of 365ms total. I agree with @CraigRinger, this looks like a sensible plan for a complex join assembly. – dbenhur Nov 13 '12 at 22:14
  • @user1696783 `DISTINCT ON` can be useful when you want to just pick some arbitrary single value instead of using a well-defined aggregate. I'm not sure it'll actually be a performance win over `GROUP BY` with aggregates, but you could give it a go. – Craig Ringer Nov 14 '12 at 00:12
  • First of all, thank you all for your answers! I like the idea of using DISTINCT ON instead of GROUP BY. This allows me not to use about ten times the max() function. But I have a doubt: with this structure is it possible to access in some way the data in the rows filtered by DISTINCT ON clause?? As you can see in the example query I wrote, I need to obtain one column of these rows and aggregate the results in a string. – Pablo Novo Nov 14 '12 at 07:43
  • This query could be much improved. But first I request that you post a valid query. What you posted is invalid. You refer to `request_envio` which is not present in the query. Ping me, In case you should post something that actually works. (Test it before posting!) It's not ok to ask for help while not making the effort to supply a reliable base. Also, the only table with multiple rows in a join is `request_reserve`, correct? – Erwin Brandstetter Nov 14 '12 at 14:23
  • Yes, @ErwinBrandstetter that's right. It is the only table with multiple rows, and I need data from all of them. That's why I aggregate data with array_agg(). Sorry about the errors in the query. It is actually working, but I have translated it in order to make it more understandable for everything, and I made some mistakes(I have corrected it). Sorry again, it was not my purpose to make anyone lose his/her time. Thank you for your time anyway!:) – Pablo Novo Nov 14 '12 at 14:39
  • Thanks for your cooperation, much appreciated. – Erwin Brandstetter Nov 14 '12 at 16:20

1 Answers1

2

You can much simplify this query by aggregating values in request_reserve and act_code before you JOIN to the big join. This avoids the need for aggregate functions on all the other columns and should generally be much faster for a larger number of rows.

SELECT r.id
      ,rt.name AS request_type
      ,to_char(r.timestamp, 'DD/mm/YYYY HH24:mi') AS timestamp
      ,to_char(r.timestamp, 'YYYY-mm-DD') AS timestamp_filtering
      ,s.name AS request_state
      ,u.name || COALESCE(' ' || u.surname, '') AS create_user
      ,r.id_create_user
      ,e.name AS enterprise
      ,c.name AS cause
      ,rm.id_request_state AS id_state
      ,rr.act_code
      ,rd.code AS state_revised
      ,rs.warehouse
      ,rr.id_warehouse
FROM      request              r
LEFT JOIN users                u  ON u.id = r.id_create_user
LEFT JOIN enterprise           e  ON e.id = u.id_enterprise
LEFT JOIN request_movements    rm ON rm.id = r.id_request_movement
LEFT JOIN request_versions     rv ON rv.id = rm.id_version
LEFT JOIN state                s  ON s.id = rm.id_request_state
     JOIN request_type         rt ON rt.id = r.id_request_type
LEFT JOIN cause_issue          c  ON c.id = r.id_cause_issue
LEFT JOIN request_shipment     rs ON rs.id_request = r.id
LEFT JOIN warehouse_enterprise w  ON w.id = rs.id_warehouse_enterprise
LEFT JOIN revised              rd ON rd.id = rs.id_revised
LEFT JOIN (
   SELECT rr.id_request, rr.id_warehouse
         ,array_to_string(array_agg(
             DISTINCT a.name || '/' || rr.act_code), ', ') AS act_code
   FROM   request_reserve rr
   LEFT   JOIN act_code   a ON r.id_act_code = a.id
   GROUP  BY rr.id_request, rr.id_warehouse
   )  rr ON rr.id_request = r.id
WHERE  r.id_request_type = "any_type";  -- use single quotes for values!

For big queries it is essential that you have a format the human eye can easily parse. Therefore I reformatted before I improved the query. I use table aliases to avoid unwieldy identifiers as much as possible.

Minor improvement to create_user: no trailing space. If either part of the name can be NULL, I suggest this to avoid a dangling space:

COALESCE(u.name || ' ' || u.surname, u.name, u.surname)

In PostgreSQL 9.1 or later you could use concat_ws().

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Ohh!! Amazing! Thank you very much Erwin! I'll try it tomorrow and I'll give feedback about the results for everybody to watch them. – Pablo Novo Nov 14 '12 at 17:30
  • I am a little suspicious about `request_shipment`, which looks like there could be n rows per `request`. But you say it's not like that .. – Erwin Brandstetter Nov 14 '12 at 18:01
  • And I'm right trust me ;) So right as you on your diagnostic. Excellent work!! The query spends now over half the time it was spending before. Thank you very much for your time. It was very useful! – Pablo Novo Nov 15 '12 at 11:52