Problem statement
I have two databases provisioned via Heroku’s PostgreSQL database-as-a-service add-on. These databases are running PostgreSQL 10.19. Heroku is deprecating PostgreSQL 10 later this year. I therefore plan to upgrade these databases to PostgreSQL 13 (Heroku’s default production version as of today’s date).
In the past, when performing a similar PostgreSQL version update, queries that were performant in PostgreSQL 9 became orders of magnitude slower in PostgreSQL 13. I am attempting to identify such queries in advance of this upgrade.
Plan
I am attempting to identify slow queries before upgrading production via the following plan:
- Fork the production databases (it contains no PII) and point a dev environment towards the forks
- Enumerate all queries in the application
- Figure out how to collect performance data, per query
- Determine how many times to run each query to gather meaningful data
- Collect baseline performance data
- Upgrade the forked database from PostgreSQL 10 to 13
- Collect post-upgrade performance data
- Compare the performance data to identify queries impacted by the upgrade
Questions
- Does this seem like a reasonable plan?
- How would you suggest collecting performance data? I’ve considered writing a script that invokes all of the enumerated queries in the application by fuzzing the inputs, then running the script some number of times. I could then either use a service such as New Relic to aggregate the Postgres log output into a dashboard, examine PostgreSQL's statistics collector via some introspection queries, or I could wrap all of the queries in an EXPLAIN ANALYZE and capture that output.
- Am I missing an obvious solution or product that would handle this for me?
- How would you determine how many times to run the above script to gather meaningful data? How do you determine an appropriate sample size?
- Is there an entirely separate approach that makes more sense?
If curious about the slow queries, please see the following SO post. In sum, after upgrading versions, Postgres decided to use a different query execution plan for an unknown reason. The resolution was refactoring the query and the understanding of why this worked is unclear to me and is not within the scope of this question.