1

I am working with an installation of PostgreSQL 11.2 that periodically complains in its system logs

FATAL:  sorry, too many clients already

despite being no-where close to its configured limit of connections. This query:

SELECT current_setting('max_connections') AS max,
       COUNT(*) AS total
FROM pg_stat_activity

tells me that the database is configured for a maximum of 100 connections. I have never seen over about 45 connections into the database with this query, not even moments before a running program receives a database error saying too many clients backed by the above message in the Postgres logs.

Absolutely everything I can find on issue on the Internet this suggests that the error means you have exceeded the max_connections setting, but the database itself tells me that I am not.

For what it's worth, pyspark is the only database client that triggers this error, and only when it's writing into tables from dataframes. The regular python code using psycopg2 (that is the main client) never triggers it (not even when writing into tables in the same manner from Pandas dataframes), and admin tools like pgAdmin also never trigger it. If I didn't see the error in the database logs directly, I would think that Spark is lying to me about the error. Most of the time, if I use a query like this:

SELECT pg_terminate_backend(pid) FROM pg_stat_activity 
WHERE pid <> pg_backend_pid() AND application_name LIKE 'pgAdmin%';

then the problem goes away for several days. But like I said, I've never seen even 50% of the supposed max of 100 connections in use, according to the database itself. How do I figure out what is causing this error?

blackbishop
  • 30,945
  • 11
  • 55
  • 76
Eddie
  • 53,828
  • 22
  • 125
  • 145
  • 3
    Have you tried setting the `log_connections` and `log_disconnections` parameters to capture the connections in the postgresql server log file? You could pull this information out and analyse it to see where the connections are coming from, it my highlight the issue. – smbennett1974 Feb 13 '21 at 16:43
  • You can go from 45 connections to 101 connections in a fraction of a second. That you have never seen it in progress doesn't provide much evidence. – jjanes Feb 13 '21 at 18:50
  • @blackbishop I am not currently using `numPartitions` when writing from pyspark. I am writing to the database with `df.write.jdbc(...)` Should a value for `numPartitions` have any connection to the spark configuration `spark.cores.max` setting? – Eddie Feb 13 '21 at 21:11
  • @smbennett1974 - Good idea, I'll log connections and disconnections for the next time this crops up. The normal apps connecting to this database use connection pools, so won't make more than about 30 connections in total. Spark makes its own connections, however. Is Spark likely to make dozens of connections to a database in a fraction of a second from a `df.write.jdbc(...)` call? – Eddie Feb 13 '21 at 21:13
  • @Eddie You can try to set `numPartitions` less than 100. [sql-data-sources-jdbc](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html): "_This also determines the maximum number of concurrent JDBC connections_". – blackbishop Feb 13 '21 at 21:40
  • @blackbishop - For now I will try adding `numPartitions=10` and see if that makes a different. It'll probably be another week before I can trigger this problem as it is intermittent. I will also enable connection and disconnection logging. Thanks for the help! – Eddie Feb 13 '21 at 21:43
  • @blackbishop - Oh my, I added a logging statement with `df.rdd.getNumPartitions()` before the code in question and it logged 200. Does this mean it will try to open 200 database connections in parallel? if so, why does this ever work, because normally it does. – Eddie Feb 14 '21 at 01:30

2 Answers2

1

Three possibilities:

  1. The connections are very short-lived, and they were already gone by the time you looked.

  2. You have a lower connection limit on that database.

  3. You have a lower connection limit on the database user.

But options 2 and 3 would result in a different error message, so it must be the short-lived connections.

Whatever it is, the answer to your problem would be a well-configured connection pool.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • 1
    It must be the short lived ones. The other two would word the message differently. (Unless his client library intercepts and re-words it) – jjanes Feb 13 '21 at 18:04
  • This query returns no rows: `SELECT rolname, rolconnlimit FROM pg_roles WHERE rolconnlimit <> -1;` Checking in pgAdmin, the database connection limit is set to -1. Can Spark create short-lived connections, over 50 in less than two seconds? The main applications using this DB use a connection pool. Spark seems to be the problem. – Eddie Feb 13 '21 at 21:25
1

This is caused by how Spark reads/writes data using JDBC. Spark tries to open several concurrent connections to the database in order to read/write multiple partitions of data in parallel.

I couldn't find it in the docs but I think by default the number of connections is equal to the number of partitions in the datafame you want to write into db table. This explains the intermittency you've noticed.

However, you can control this number by setting numPartitions option:

The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.

Example:

spark.read.format("jdbc") \
          .option("numPartitions", "20") \
# ...
blackbishop
  • 30,945
  • 11
  • 55
  • 76