2

This is a question about PostgreSQL's EXPLAIN command. This command shows you how the optimizer will execute your SQL based on the data in your tables. We are not in prod yet, so all of our tables have ~100 rows or less. Is there a way to get EXPLAIN to tell me what the explain plan would look like if certain tables had millions of rows instead of tens of rows?

I could generate the data somehow, but then I'd have to clear it and wait for it to be created. If that's the only way, I'll accept that as an answer, though.

Daniel Kaplan
  • 62,768
  • 50
  • 234
  • 356
  • That approach is doomed. Changing the row count considerably can cause entirely different plans to be chosen, and there is no good way to predict that. There is no way to evade testing with realistic amounts of data. That said, 20 million rows may be a good enough approximation to 100 million rows. But there is always the possibility that the approximation is not good enough. – Laurenz Albe Oct 15 '21 at 06:02
  • @LaurenzAlbe "Changing the row count considerably can cause entirely different plans to be chosen, and there is no good way to predict that." I can make an educated guess as to the ratio of data between the tables. – Daniel Kaplan Oct 15 '21 at 19:11
  • 1
    Sure, but that is not good enough. For example, a hash for the smaller table may fit into `work_mem`, so that you get a hash join, but for the larger table you will get a merge join, resulting in a completely different plan. – Laurenz Albe Oct 18 '21 at 05:45
  • @LaurenzAlbe Good to know. How do you learn how to read the explain output? As far as I can tell, the documentation of explain doesn't link to it. – Daniel Kaplan Oct 18 '21 at 05:48
  • There is a special chapter in the documentation that deals with reading `EXPLAIN` output. – Laurenz Albe Oct 18 '21 at 06:02
  • For others, here's a link to how to read explain plans: https://www.postgresql.org/docs/current/using-explain.html – Daniel Kaplan Oct 18 '21 at 16:31

1 Answers1

2

I don't think so. Postgresql collects some statistics related to the table that the optimizer will use to choose the best plan. These statistics are not related to simply how many rows a table contains but they will depends on the values/data too.

From the postgres documentation:

the query planner needs to estimate the number of rows retrieved by a query in order to make good choices of query plans.

What does it mean that? Suppose we have an indexed column called foo, without a non-unique constraint. Suppose you have the following simple query:

SELECT * FROM test_table WHERE foo = 5

Postgresql will have to choose between different index scans:

  • sequential scan
  • index scan
  • bitmap scan

It will choose the type of scan based on how many rows it thinks to retrieve from the query. How does it know how many rows will be retrieved before running the query? With the statistics that it collects. These statistics are based on the VALUES/DATA inside your table. Suppose you have a table with 1 million of rows and 90% of them have foo = 5. Postgresql may be know that, because it could have collected some statistics about the distribution of your data. So it will chose a sequential scan, because according to its cost model, this scan is the cheapest one.

In the end, it will would be not enough generate data, but you should generate values that will represent the reality (the data that you will have in the future).

You can already build your database creating some indexes (based on the query that you will do) to have already good performance in production. If it will be not enough you will have to tune your indexes after you go into production.

SGiux
  • 619
  • 3
  • 10
  • 34
  • You can set the statistics. That doesn't provide a way to pull this off? – Daniel Kaplan Oct 15 '21 at 17:50
  • 1
    You can set the statistics, but you may encounter three big problems: 1- the are a lot of statistics that postgresql use to analyze your data, 2- you need to know what these statistics mean, 3- when you know the meaning of all the statistics, you have to set them. To set them you need to know the data you will store. It's a big (and in my opinion) counterproductive work. What if you set the statistics in a way to discover after the you were wrong? You can already build a performing db using indexes and tuning it after. You can firstly collect data in dev and deploy in prod after if you wish. – SGiux Oct 15 '21 at 20:23
  • Is your position that running explain before you have a lot of data is a pre-optimization? If so, I understand your POV and now agree. – Daniel Kaplan Oct 15 '21 at 22:40
  • 1
    Yes, we can call it pre-optimization and gives you already good result in terms of performance. – SGiux Oct 16 '21 at 06:13
  • Thanks sense. Thanks for putting it in those terms. – Daniel Kaplan Oct 16 '21 at 22:39