0

I've set up doobie + hikaricp with settings similar to the default. With debug logging I can see how many connections are active / idle and this also reflects what pg_stat_activity returns with this query

SELECT
  pid,
  query,
  query_start,
  state
FROM pg_stat_activity;

HikariPool-1 - Pool stats (total=10, active=4, idle=6, waiting=0)

My app mainly processes messages from a stream by storing them in the db with through a transaction that consists of 3 inserts and 2 selects before committing.

Below is how the transactor is created with hikariCP:

val hikariConfig = new HikariConfig()
hikariConfig.setJdbcUrl(dbUrl)
hikariConfig.setDriverClassName("org.postgresql.Driver")
hikariConfig.setUsername(dbUser)
hikariConfig.setPassword(dbPassword)
hikariConfig.setMaximumPoolSize(10)
hikariConfig.setAutoCommit(false)
hikariConfig.addDataSourceProperty("socketTimeout", "30")

val dataSource: HikariDataSource     = new HikariDataSource(hikariConfig)
val transactor: HikariTransactor[IO] = HikariTransactor[IO](dataSource)

this transactor is then passed everywhere to handle database transactions:

import doobie._
import doobie.implicits._
import doobie.postgres.implicits._

val query = for {
  _ <- sql"insert into foo (id, value) values (fooId, 'b')".update.run
  _ <- sql"insert into bar (id, value) values (barId, 'b')".update.run
  _ <- sql"select * from bar where id = $barId".query[Bar].unique
  _ <- sql"insert into bazz (id, value1, value2) values (bazzId, fooId, barId)".update.run
  _ <- sql"select * from bazz where id = $barId".query[Bazz].unique
} yield ()

query.transact(transactor).unsafeToFuture()

The issue I'm seeing is that the queries seem fast but insert throughput is slow.

Some points I noticed:

  1. Checking pg_stat_activity, I see many COMMIT queries (more than the inserts) where each take between ~90ms - 200ms.

    SELECT pid, now() - pg_stat_activity.query_start AS duration, query, query_start, state FROM pg_stat_activity where state <> 'idle' order by duration desc;

  2. The above pg_stat_activity query only returns 4-8 rows at a time, majority of the results are query = 'COMMIT'.

  3. Starting from an empty stream, the throughput can go as high as 30k messages a minute. After 15-20 minutes though, the throughput drops to 3k messages a minute and does not reset recover until the stream is empty for many hours where restarting the app or scaling more instances of the app does not improve throughput.

Both the database's and app's CPU (10%) / Mem usage (16%) is low so theoretically the throughput should be able to go much higher, what would be some areas worth investigating?

currently using doobie 0.5.4, hikariCP 0.5.4

UPDATE: Raised commit_delay and commit_siblings settings on the rds database from Laurenz's suggestion which prolonged the period of high throughput from 30 minutes -> 60 minutes before the prolonged decrease in throughput as previous. I started testing raising max_wal_size but this seems to cause spikier throughput (at times very low) not close to the original startup rate.

irregular
  • 1,437
  • 3
  • 20
  • 39

1 Answers1

3

If COMMIT takes long, that's a clear indication that your I/O system is overloaded.

If it is not the I/O volume that is the problem, it is probably the many WAL sync requests from the many transactions.

Some options:

  • Perform the work with fewer, bigger transactions.

  • If you can afford to lose half a second of committed transactions in the event of a crash, set synchronous_commit to off.

  • If losing committed transactions is not an option, tune commit_delay and commit_siblings. That will make transactions take slightly longer, but it can reduce the I/O load.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • So I am currently testing on an AWS RDS PostgreSQL "db.m5.xlarge". By I/O system, would that relate to the `Up to 10 Gbps` they have on the specs? I am wondering if upgrading to the `20 Gbps` is a possible solution (although I'd like to avoid this). I'll take a look at the other options and see how those go! https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.html#Concepts.DBInstanceClass.Terminology – irregular Sep 20 '19 at 14:41
  • I *guess* that throughput is not your problem, it is I/O requests per second. Can you run `iostat -mNx 1` on the machine? – Laurenz Albe Sep 20 '19 at 14:49
  • Update: I think this answer is accurate, I raised both `commit_delay` and `commit_siblings` to the max (`100000` and `1000`) and saw an immediate increase in throughput back to the original values! Unfortunately the increase in throughput (or reduction in I/O load) only lasted for 60 minutes and then dropped similar to previously. I'll have to look into how to run the `iostat -mNx 1` on the AWS RDS machine. While looking at `commit_delay` and `commit_siblings`, I also saw `max_wal_size` `min_wal_size`. Would it be wise to raise those to values to the max as well? – irregular Sep 20 '19 at 16:59
  • Update: I attempted raising `max_wal_size` from 1024 -> 4096 and throughput actually dropped further – irregular Sep 20 '19 at 20:07
  • Raising `commit_delay` and `commit_siblings` to the maximum is useless. These parameters require deliberation and testing to tune them right. First, test if this is the problem area by checking if `synchronous_commit = off` improves the situation. Consider moving away from hosted databases. – Laurenz Albe Sep 21 '19 at 16:22
  • `synchronous_commit = off` improves the situation dramatically although there is still the steady decline in throughput, the decline is many times slower. looking at the postgres logs, `2019-09-23 15:17:49 UTC::@:[6685]:LOG: checkpoint complete: wrote 149086 buffers (30.3%); 0 WAL file(s) added, 0 removed, 44 recycled; write=56.535 s, sync=0.011 s, total=56.641 s; sync files=24, longest=0.006 s, average=0.000 s; distance=720894 kB, estimate=720914 kB` – irregular Sep 23 '19 at 16:09
  • Currently after 40 minutes it is `2019-09-23 15:59:00 UTC::@:[6685]:LOG: checkpoint complete: wrote 134906 buffers (27.4%); 0 WAL file(s) added, 0 removed, 44 recycled; write=79.212 s, sync=0.008 s, total=79.303 s; sync files=24, longest=0.004 s, average=0.000 s; distance=720861 kB, estimate=720991 kB` So over 40 minutes the throughput goes from 100% -> 70% and then after 70% there is a steep drop to 30% (of message processing throughput) – irregular Sep 23 '19 at 16:10
  • and in 5 minute intervals the throughput continues halving until 10% of the initial 100% – irregular Sep 23 '19 at 16:17
  • That doesn't seem like PostgreSQL at all. Can you profile the application to see if the time is really spent in the database? – Laurenz Albe Sep 23 '19 at 16:22
  • I do see that most time is spent in the database and scaling up the application doesn't result in improved throughput, even doubling the number of instances makes the throughput go from 10% -> 12% maybe I'm not sure if these are the right logs but the drops in throughput correspond with longer write times here `2019-09-23 16:10:35 UTC::@:[6685]:LOG: checkpoint complete: wrote 134159 buffers (27.3%); 0 WAL file(s) added, 0 removed, 44 recycled; write=408.888 s, sync=0.325 s, total=409.758 s; sync files=24, longest=0.103 s, average=0.013 s; distance=720883 kB, estimate=720965 kB` – irregular Sep 25 '19 at 13:41
  • Hmm. That clearly looks like it is the I/O system's fault. PostgreSQL writes the same amount of data, but the writes are becoming slower. Perhaps there is a large write cache in the storage that gets filled? I'd ask the storage admin what is the matter. PostgreSQL doesn't seem to be at fault. – Laurenz Albe Sep 25 '19 at 14:00