I mainly use the R package dbplyr to interact with a PostgreSQL database. That works by "piping" operations which are then translated to SQL and executed in one query. This tends to result in many nested joins. I wonder how clever the planner is when it comes to resolving such rather verbose and un-optimized expressions. Is it even possible to write "bad queries" as long as they only use e.g. SELECT, WHERE and JOIN (and no functions, casting etc) and where the end result is the same? How would such a query look? For example, will the planner figure out which columns are needed in a hash join in order to reduce memory, even if the columns are not specified in that join but only at the end after a join involving, say, 6 tables?
For example, can I safely ignore:
- Join order
- When columns are selected
- When filters are applied
I've found a lot of info about how the planner calculates costs and chooses a path, but not so much on how it arrives at a "minimal form" of the query at the first place. EXPLAIN ANALYZE doesn't help because it doesn't show which columns end up being selected. I'm sure someone will be unhappy with this question due to being too vague. If so, please point me in the right direction :)
EDIT:
An example.
Here is how a typical query would look in R with dbplyr. "gene_annotations" have the columns "gene" and "annotation_term". "genemaps" have "genemap", "gene", "probe", "study". Here I want to get the gene and annotation associated to a probe.
tbl(con, "gene_annotations") %>% inner_join(tbl(con, "genemaps"), by = "gene") %>%
filter(probe == 1L) %>% select(gene, annotation_term)
This translates to:
SELECT "gene", "annotation_term"
FROM (SELECT "LHS"."gene" AS "gene", "LHS"."annotation_term" AS "annotation_term", "RHS"."genemap" AS "genemap", "RHS"."probe" AS "probe", "RHS"."study" AS "study"
FROM "gene_annotations" AS "LHS"
INNER JOIN "genemaps" AS "RHS"
ON ("LHS"."gene" = "RHS"."gene")
) "dbplyr_004"
WHERE ("probe" = 1)
Can I trust that this has the exact same performance as e.g. this expression (except for the time for parsing and analyzing the expression)?
tbl(con, "gene_annotations") %>% inner_join(tbl(con, "genemaps") %>%
filter(probe == 1L) %>% select(gene) , by = "gene")
SELECT "LHS"."gene" AS "gene", "LHS"."annotation_term" AS "annotation_term"
FROM "gene_annotations" AS "LHS"
INNER JOIN (SELECT "gene"
FROM "genemaps"
WHERE ("probe" = 1)) "RHS"
ON ("LHS"."gene" = "RHS"."gene")
The plan is the same in both cases:
Nested Loop (cost=0.86..72.09 rows=546 width=8)
-> Index Only Scan using genemaps_probe_index on genemaps (cost=0.43..2.16 rows=36 width=4)
Index Cond: (probe = 1)
-> Index Only Scan using gene_annotations_pkey on gene_annotations "LHS" (cost=0.43..1.79 rows=15 width=8)
Index Cond: (gene = genemaps.gene)
I didn't want to provide an example, because I don't have an issue with this specific query. I'm wondering is if I can always disregard these issues altogether and just piece together joins until I get the end result I want.
EDIT 2:
I found out that there is a VERBOSE option to EXPLAIN where you can see which columns are returned. For the small example above the plan was identical also in that regard. Still, can I assume that holds for all reasonably complex queries? This is an example of how my queries typically look. As you can see, the SQL dbplyr generates isn't very easy to read. Here it joins six tables after various SELECT/WHERE.
SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "LHS"."value" AS "value", "LHS"."gene" AS "gene", "LHS"."probe" AS "probe", "LHS"."gene_symbol" AS "gene_symbol", "LHS"."probe_name" AS "probe_name", "RHS"."factor_order" AS "factor_order"
FROM (SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "LHS"."value" AS "value", "LHS"."gene" AS "gene", "LHS"."probe" AS "probe", "LHS"."gene_symbol" AS "gene_symbol", "RHS"."probe_name" AS "probe_name"
FROM (SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "LHS"."value" AS "value", "LHS"."gene" AS "gene", "LHS"."probe" AS "probe", "RHS"."gene_symbol" AS "gene_symbol"
FROM (SELECT "sample_group", "sample_group_name", "sample_group_description", "sample", "sample_name", "value", "gene", "probe"
FROM (SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "LHS"."genemap" AS "genemap", "LHS"."annotation_term" AS "annotation_term", "LHS"."value" AS "value", "RHS"."gene" AS "gene", "RHS"."probe" AS "probe"
FROM (SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "RHS"."genemap" AS "genemap", "RHS"."annotation_term" AS "annotation_term", "RHS"."value" AS "value"
FROM (SELECT *
FROM (SELECT "sample_group", "sample_group_name", "sample_group_description", "sample", "sample_name"
FROM "sample_view") "dbplyr_031"
WHERE (270 = 270)) "LHS"
INNER JOIN "gene_measurements" AS "RHS"
ON ("LHS"."sample" = "RHS"."sample")
) "LHS"
INNER JOIN (SELECT "genemap", "gene", "probe"
FROM "genemaps"
WHERE ("gene" IN (54812) AND "study" = 270)) "RHS"
ON ("LHS"."genemap" = "RHS"."genemap")
) "dbplyr_032") "LHS"
INNER JOIN (SELECT "gene", "gene_symbol"
FROM "genes") "RHS"
ON ("LHS"."gene" = "RHS"."gene")
) "LHS"
INNER JOIN (SELECT "probe", "probe_name"
FROM "probes") "RHS"
ON ("LHS"."probe" = "RHS"."probe")
) "LHS"
INNER JOIN (SELECT "group", "annotation_term_value" AS "factor_order"
FROM (SELECT "LHS"."group" AS "group", "LHS"."annotation_term" AS "annotation_term", "RHS"."annotation_term_value" AS "annotation_term_value"
FROM "group_annotations" AS "LHS"
INNER JOIN (SELECT "annotation_term", "annotation_term_value"
FROM "annotation_terms"
WHERE ("annotation_type" = 111)) "RHS"
ON ("LHS"."annotation_term" = "RHS"."annotation_term")
) "dbplyr_033") "RHS"
ON ("LHS"."sample_group" = "RHS"."group")