0

I mainly use the R package dbplyr to interact with a PostgreSQL database. That works by "piping" operations which are then translated to SQL and executed in one query. This tends to result in many nested joins. I wonder how clever the planner is when it comes to resolving such rather verbose and un-optimized expressions. Is it even possible to write "bad queries" as long as they only use e.g. SELECT, WHERE and JOIN (and no functions, casting etc) and where the end result is the same? How would such a query look? For example, will the planner figure out which columns are needed in a hash join in order to reduce memory, even if the columns are not specified in that join but only at the end after a join involving, say, 6 tables?

For example, can I safely ignore:

  • Join order
  • When columns are selected
  • When filters are applied

I've found a lot of info about how the planner calculates costs and chooses a path, but not so much on how it arrives at a "minimal form" of the query at the first place. EXPLAIN ANALYZE doesn't help because it doesn't show which columns end up being selected. I'm sure someone will be unhappy with this question due to being too vague. If so, please point me in the right direction :)

EDIT:

An example.

Here is how a typical query would look in R with dbplyr. "gene_annotations" have the columns "gene" and "annotation_term". "genemaps" have "genemap", "gene", "probe", "study". Here I want to get the gene and annotation associated to a probe.

tbl(con, "gene_annotations") %>% inner_join(tbl(con, "genemaps"), by = "gene") %>%
filter(probe == 1L) %>% select(gene, annotation_term)

This translates to:

SELECT "gene", "annotation_term"
FROM (SELECT "LHS"."gene" AS "gene", "LHS"."annotation_term" AS "annotation_term", "RHS"."genemap" AS "genemap", "RHS"."probe" AS "probe", "RHS"."study" AS "study"
FROM "gene_annotations" AS "LHS"
INNER JOIN "genemaps" AS "RHS"
ON ("LHS"."gene" = "RHS"."gene")
) "dbplyr_004"
WHERE ("probe" = 1)

Can I trust that this has the exact same performance as e.g. this expression (except for the time for parsing and analyzing the expression)?

tbl(con, "gene_annotations") %>% inner_join(tbl(con, "genemaps") %>%
filter(probe == 1L) %>% select(gene) , by = "gene")
SELECT "LHS"."gene" AS "gene", "LHS"."annotation_term" AS "annotation_term"
FROM "gene_annotations" AS "LHS"
INNER JOIN (SELECT "gene"
FROM "genemaps"
WHERE ("probe" = 1)) "RHS"
ON ("LHS"."gene" = "RHS"."gene")

The plan is the same in both cases:

Nested Loop  (cost=0.86..72.09 rows=546 width=8)
  ->  Index Only Scan using genemaps_probe_index on genemaps  (cost=0.43..2.16 rows=36 width=4)
        Index Cond: (probe = 1)
  ->  Index Only Scan using gene_annotations_pkey on gene_annotations "LHS"  (cost=0.43..1.79 rows=15 width=8)
        Index Cond: (gene = genemaps.gene)

I didn't want to provide an example, because I don't have an issue with this specific query. I'm wondering is if I can always disregard these issues altogether and just piece together joins until I get the end result I want.

EDIT 2:

I found out that there is a VERBOSE option to EXPLAIN where you can see which columns are returned. For the small example above the plan was identical also in that regard. Still, can I assume that holds for all reasonably complex queries? This is an example of how my queries typically look. As you can see, the SQL dbplyr generates isn't very easy to read. Here it joins six tables after various SELECT/WHERE.

SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "LHS"."value" AS "value", "LHS"."gene" AS "gene", "LHS"."probe" AS "probe", "LHS"."gene_symbol" AS "gene_symbol", "LHS"."probe_name" AS "probe_name", "RHS"."factor_order" AS "factor_order"
FROM (SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "LHS"."value" AS "value", "LHS"."gene" AS "gene", "LHS"."probe" AS "probe", "LHS"."gene_symbol" AS "gene_symbol", "RHS"."probe_name" AS "probe_name"
FROM (SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "LHS"."value" AS "value", "LHS"."gene" AS "gene", "LHS"."probe" AS "probe", "RHS"."gene_symbol" AS "gene_symbol"
FROM (SELECT "sample_group", "sample_group_name", "sample_group_description", "sample", "sample_name", "value", "gene", "probe"
FROM (SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "LHS"."genemap" AS "genemap", "LHS"."annotation_term" AS "annotation_term", "LHS"."value" AS "value", "RHS"."gene" AS "gene", "RHS"."probe" AS "probe"
FROM (SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "RHS"."genemap" AS "genemap", "RHS"."annotation_term" AS "annotation_term", "RHS"."value" AS "value"
FROM (SELECT *
FROM (SELECT "sample_group", "sample_group_name", "sample_group_description", "sample", "sample_name"
FROM "sample_view") "dbplyr_031"
WHERE (270 = 270)) "LHS"
INNER JOIN "gene_measurements" AS "RHS"
ON ("LHS"."sample" = "RHS"."sample")
) "LHS"
INNER JOIN (SELECT "genemap", "gene", "probe"
FROM "genemaps"
WHERE ("gene" IN (54812) AND "study" = 270)) "RHS"
ON ("LHS"."genemap" = "RHS"."genemap")
) "dbplyr_032") "LHS"
INNER JOIN (SELECT "gene", "gene_symbol"
FROM "genes") "RHS"
ON ("LHS"."gene" = "RHS"."gene")
) "LHS"
INNER JOIN (SELECT "probe", "probe_name"
FROM "probes") "RHS"
ON ("LHS"."probe" = "RHS"."probe")
) "LHS"
INNER JOIN (SELECT "group", "annotation_term_value" AS "factor_order"
FROM (SELECT "LHS"."group" AS "group", "LHS"."annotation_term" AS "annotation_term", "RHS"."annotation_term_value" AS "annotation_term_value"
FROM "group_annotations" AS "LHS"
INNER JOIN (SELECT "annotation_term", "annotation_term_value"
FROM "annotation_terms"
WHERE ("annotation_type" = 111)) "RHS"
ON ("LHS"."annotation_term" = "RHS"."annotation_term")
) "dbplyr_033") "RHS"
ON ("LHS"."sample_group" = "RHS"."group")
Rasmus
  • 69
  • 7
  • I think you may get more traction with this one on https://dba.stackexchange.com/. There is some questions that need to be answered first though. Like: You say "even if the columns are not specificed in that join but only at the end after a join involving, say, 6 tables". What does this mean? Are you using `ON` clauses in your `FROM` clause? Are you defining table relationships in you `WHERE` clause? I would highly suggest sharing some of this SQL that is being generated as a starting point. – JNevill Jun 01 '20 at 13:26
  • Yes, it is pretty vague. Can you provide one concrete example? – jjanes Jun 01 '20 at 13:32
  • Added a small example. – Rasmus Jun 01 '20 at 14:03
  • Why don't you check the execution plan? –  Jun 01 '20 at 15:45
  • As I wrote in the first edit the plan was the same, but not all information was in the plan (then I discovered the VERBOSE option). Still, that doesn't mean that I can generalize and disregard join/filtering/select order is all cases. Which is what I want to know if I can :) – Rasmus Jun 01 '20 at 16:19
  • 1
    The join order stuff is quite well documented. See here https://www.postgresql.org/docs/current/explicit-joins.html – Endrju Jun 01 '20 at 20:31
  • Thanks, that's exactly the type of information I was looking for. Very helpful! – Rasmus Jun 02 '20 at 05:45

0 Answers0