Recently we updated the database structure of our multi-tenant Phoenix app to implement declarative partitioning for several tables (using PostgreSQL 14 on an AWS RDS instance). Following Ecto's guides on multi-tenancy, we were able to ensure that all frequently occurring queries use the tenant ID foreign key in WHERE
clauses, which is also the partition key and part of the composite primary key for the several child tables that have been partitioned.
We have confirmed that partition pruning is enabled, and since our queries filter by this static parition key we see that they are correctly only searching the relevant partition. We currently have just over 1,000 partitions for six tables.
The problem was that queries were regularly timing out due to not receiving a connection. We confirmed this was not a query processing timeout by the error reports: (DBConnection.ConnectionError) connection not available and request was dropped from queue after 801ms
. Also the same queries that would regularly take less than 100 ms would sometimes hang for several seconds (10 or more) before receiving this error.
We tried first following the advice from the error provided by DBConnection: ensured our database was generally available for connection, ensure that queries were not so slow as to hold all connections, adjusting the pool_size
, and increased the queue_target
. The level of benefit these solutions provided was indeterminate; they definitely did not solve the problem to an acceptable level.
We next tried to use AWS RDS Proxy to handle DB connection distribution. This does seem to have solved the problem to an acceptable level and tremendously decreased and connection timeout issues. Our main concern is that we don't know why this problem happened in the first place and therefore are not confident this solution is final. Our research into partitioning has not mentioned DB connection handling as a likely hurdle.
Does database connection handling become problematic based on the number of partitions and near our intended scale (up to 10,000)? What else could have lead to these connection issues, and why would RDS Proxy solve them compared to DBConnection's default handling (and should we feel confident it has solved them)?