PostgreSQL slow commit performance

Question

We are having some problems with a PostgreSQL configuration. After some benchmarks I found out that very simple queries are taking a relatively long time, upon furter inspection it appears that the actual COMMIT command is really slow.

I ran a very simple test using the following table:

CREATE TABLE test (
    id serial primary key,
    foo varchar(16),
);

After turning on logging on all statements I ran the following query 10000 times:

BEGIN;
INSERT INTO test (a) VALUES ('bar');
COMMIT;

The BEGIN and the INSERT are taking < 1ms to complete, but the COMMIT is taking an average of 22ms to complete.

Running the same benchmark on my own PC, which is a lot slower, yields the same average for the BEGIN and INSERT statements, but the average COMMIT is around 0.4ms (more then 20 times faster.)

After some reading I tried the pg_test_fsync tool to try to pin down the problem. On the server I get these results:

$ ./pg_test_fsync -o 1024
1024 operations per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                      14.875 ops/sec
        fdatasync                          11.920 ops/sec
        fsync                              30.524 ops/sec
        fsync_writethrough                            n/a
        open_sync                          30.425 ops/sec

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                      19.956 ops/sec
        fdatasync                          23.299 ops/sec
        fsync                              21.955 ops/sec
        fsync_writethrough                            n/a
        open_sync                           3.619 ops/sec

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
        16kB open_sync write                5.923 ops/sec
         8kB open_sync writes               3.120 ops/sec
         4kB open_sync writes              10.246 ops/sec
         2kB open_sync writes               1.787 ops/sec
         1kB open_sync writes               0.830 ops/sec

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close                34.371 ops/sec
        write, close, fsync                36.527 ops/sec

Non-Sync'ed 8kB writes:
        write                           248302.619 ops/sec

On my own PC I get:

$ ./pg_test_fsync -o 1024
1024 operations per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                      69.862 ops/sec
        fdatasync                          68.871 ops/sec
        fsync                              34.593 ops/sec
        fsync_writethrough                            n/a
        open_sync                          26.595 ops/sec

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                      26.872 ops/sec
        fdatasync                          59.056 ops/sec
        fsync                              34.031 ops/sec
        fsync_writethrough                            n/a
        open_sync                          17.284 ops/sec

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
        16kB open_sync write                7.412 ops/sec
         8kB open_sync writes               3.942 ops/sec
         4kB open_sync writes               8.700 ops/sec
         2kB open_sync writes               4.161 ops/sec
         1kB open_sync writes               1.492 ops/sec

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close                35.086 ops/sec
        write, close, fsync                34.043 ops/sec

Non-Sync'ed 8kB writes:
        write                           240544.985 ops/sec

The server's configuration:

CPU: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
RAM: 32GB
Disk: 2x 2TB SATA disk in Software RAID 1

The machine used for comparison is an i5 with 16GB of RAM and plain SATA disks (no raid).

More info:

OS: Ubuntu server 12.10
Kernel: Linux ... 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Software RAID 1
Filesystem is ext4
No other mount options specified.
Postgres version 9.1
Linux mdadm raid

Output of dump2efs:

dumpe2fs 1.42.5 (29-Jul-2012)
Filesystem volume name:   <none>
Last mounted on:          /
Filesystem UUID:          16e30b20-0531-4bcc-877a-818e1f5d5fb2
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash 
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              182329344
Block count:              729289039
Reserved block count:     36464451
Free blocks:              609235080
Free inodes:              182228152
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      850
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   256
RAID stride:              1
Flex block group size:    16
Filesystem created:       Sat Jan 19 12:42:19 2013
Last mount time:          Wed Jan 23 16:23:11 2013
Last write time:          Sat Jan 19 12:46:13 2013
Mount count:              8
Maximum mount count:      30
Last checked:             Sat Jan 19 12:42:19 2013
Check interval:           15552000 (6 months)
Next check after:         Thu Jul 18 13:42:19 2013
Lifetime writes:          257 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:           128
Journal inode:            8
First orphan inode:       17304375
Default directory hash:   half_md4
Directory Hash Seed:      a71fa518-7696-4a28-bd89-b21c10d4265b
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke
Journal size:             128M
Journal length:           32768
Journal sequence:         0x000df5a4
Journal start:            31733

Mdadm --detail output:

/dev/md2:
        Version : 1.2
  Creation Time : Sat Jan 19 12:42:05 2013
     Raid Level : raid1
     Array Size : 2917156159 (2782.02 GiB 2987.17 GB)
  Used Dev Size : 2917156159 (2782.02 GiB 2987.17 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Fri Mar 22 11:16:45 2013
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : rescue:2
           UUID : d87b98e7:d584a4ed:5dac7907:ae5639b0
         Events : 38

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3

Update 2013-03-25: I ran a long smart test on both disks, which revealed no problems. Both disks are from Seagate, model: ST3000DM001-9YN166.

Update 2013-03-27: I ran the pg_test_fsync of the latest version (9.2.3) on a completely idle machine:

$ ./pg_test_fsync -s 3
3 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                      39.650 ops/sec
        fdatasync                          34.283 ops/sec
        fsync                              19.309 ops/sec
        fsync_writethrough                            n/a
        open_sync                          55.271 ops/sec

It is slightly better then before, but still deplorable. Partitions on both disks are aligned:

$ sudo parted /dev/sdb unit s print
Model: ATA ST3000DM001-9YN1 (scsi)
Disk /dev/sdb: 5860533168s
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start      End          Size         File system  Name  Flags
 4      2048s      4095s        2048s                           bios_grub
 1      4096s      25169919s    25165824s                       raid
 2      25169920s  26218495s    1048576s                        raid
 3      26218496s  5860533134s  5834314639s                     raid

Mount -v output:

$ mount -v | grep ^/dev/
/dev/md2 on / type ext4 (rw,noatime)
/dev/md1 on /boot type ext3 (rw)

The md2 device is being used for the tests. Going to destroy the swap partition and run pg_test_fsync on individual disks.

If I run pg_test_fsync on both disks individually I get roughly the same performance, the partition was mounted with noatime:

$ pg_test_fsync/pg_test_fsync -s 3

3 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                      75.111 ops/sec
        fdatasync                          71.925 ops/sec
        fsync                              37.352 ops/sec
        fsync_writethrough                            n/a
        open_sync                          33.746 ops/sec

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                      38.204 ops/sec
        fdatasync                          49.907 ops/sec
        fsync                              32.126 ops/sec
        fsync_writethrough                            n/a
        open_sync                          13.642 ops/sec

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
         1 * 16kB open_sync write          25.325 ops/sec
         2 *  8kB open_sync writes         12.539 ops/sec
         4 *  4kB open_sync writes          6.207 ops/sec
         8 *  2kB open_sync writes          3.098 ops/sec
        16 *  1kB open_sync writes          1.208 ops/sec

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close                27.275 ops/sec
        write, close, fsync                20.561 ops/sec

Non-Sync'ed 8kB writes:
        write                           562902.020 ops/sec

After running the test a couple of times, both on the array and on the single disk, the numbers seem to vary wildly. Worst case the performance is about 50% of what I posted here (somewhere around 30 ops/s for the first test.) Is this normal? The machine is completely idle, all the time.

Also, according to the dmesg output, the controller is in AHCI mode.

Can you provide some details on that software RAID? What software? Linux's `mdadm` or `dmraid`? Something vendor-specific? Something else? Your PostgreSQL version and host OS version would help too. — Craig Ringer, Mar 22 '13 at 10:01

score 7 · Answer 1 · answered Mar 22 '13 at 09:54

7

The server has incredibly, unspeakably, amazingly slow fsync performance. There's something very badly wrong with your software RAID 1 setup. The terrible fsync performance is almost certainly the cause of your performance problems.

The desktop merely has very slow fsync.

You can work around the performance issues at the cost of losing some data after a crash by setting synchronous_commit = off and setting a commit_delay. You really need to sort out the disk performance on the server, though, that's jaw-droppingly slow.

For comparison, here's what I get on my laptop (i7, 8GB RAM, mid-range 128G SSD, pg_test_fsync from 9.2):

Compare file sync methods using one 8kB write:

        open_datasync                    4445.744 ops/sec
        fdatasync                        4225.793 ops/sec
        fsync                            2742.679 ops/sec
        fsync_writethrough                            n/a
        open_sync                        2907.265 ops/sec

Admittedly this SSD probably isn't hard-power-loss-safe, but it's not like a decent power-fail-safe SSD costs a great deal when we're talking server costs.

answered Mar 22 '13 at 09:54

Craig Ringer

11,083
9
40
61

1

Yes, but what is causing the bad fsync performance? – Blubber Mar 22 '13 at 09:55
I tried running pg_test_fsync on my own SSD, and I get comparable performance figures. I know that I could disable sync commits, but the question remains, what is the cause of this problem? – Blubber Mar 22 '13 at 10:02
@Blubber What software RAID are you using? What file system? What host OS and version? What file system mount options? These are all critical information if you're seeking the root cause; please update your question. You should also do SMART health checks on the hard drives (`smartctl -d ata -a /dev/sdx|less` and `smartctl -d ata -t long /dev/sdx` followed by a `sleep 90m` or whatever `smartctl` tells you followed by another `-d ata -a` to get the results). – Craig Ringer Mar 22 '13 at 10:04
@Blubber Even if you fix the RAID issues your performance will still be bad, just not quite *as* bad. Plain old 7200RPM (or, worse, 5400RPM) disks are a *bad* choice for database performance, especially without a proper hardware RAID controller with battery-backed cache that lets the controller group up and buffer writes. – Craig Ringer Mar 22 '13 at 10:12
I have updated the question with more details about the filesystem and raid config. I understand that this machine is never going to give very good performance in it's current configuration. But the current performance is _really_ bad. – Blubber Mar 22 '13 at 10:18
@Blubber OK, that's sane-ish. What about the `SMART` tests? Also check `dmesg` for errors pertaining to disks and I/O. If one of the disks is failing or has bad sectors that'd more than explain the performance. You should also check that you're not using 4k sector disks with misaligned partitions. Please specify the **exact disk models** and show the output of `fdisk -l /dev/sda` and `fdisk -l /dev/sdb`. If you don't know the disk model you can get it from a `smartctl` query of the disk. – Craig Ringer Mar 22 '13 at 11:22
The short smart test on both disks revealed no problems, I started long tests on both, and will check the results asap. dmesg and the syslog showed no errors. I checked the partition aligned friday, they are aligned on 4k boundaries, also, the disks use GPT, so fdisk -l doesn't show useful output. – Blubber Mar 25 '13 at 06:01
@Blubber Thanks, doesn't matter about fdisk so long as you've confirmed they're 4k aligned. Very odd. Done any throughput tests on the individual disks in the array? Looked at the system `iostat` and `vmstat` to see if there's other activity that might be hammering the disks? Tried examining `blkstat` output on the `md` and the other partitions? Again, also, please confirm the exact disk models. – Craig Ringer Mar 25 '13 at 06:36
I have amended my original question with information about the disk types. In the meantime I tried setting noatime and killed all other processes (when possible) on the machine. This increases the performance two-fold, from extremely bad, to just really bad. – Blubber Mar 25 '13 at 07:30
@Blubber Have you updated their firmwares? (http://knowledge.seagate.com/articles/en_US/FAQ/223651en) – Craig Ringer Mar 25 '13 at 07:54
Just checked, the firmware on both disks is at the latest version (CC4H). – Blubber Mar 25 '13 at 08:18
@Blubber And the rest - blkstat, iostat, etc? This kind of thing usually doesn't have easy answers, you need to slowly trace and isolate the problem. Things like benchmarking the individual disks separately, then the md0 array on top can help; `bonnie++` can be good for this, as can `pg_test_fsync` on partitions created directly on the drive without the `md` layer. – Craig Ringer Mar 26 '13 at 05:57
+1. When I saw the IOPS numbers I thought "cheap totally overloaded VPS". This is beyond comical, even for the slow SATA discs. Good catch - the numbers are just ridiculous, something is VERY wrong here. – TomTom Mar 26 '13 at 14:19

Tometzky · Answer 2 · 2013-03-26T14:05:55.130

This is pg_test_fsync output on my server, with very similar configuration — Linux software RAID1 on 2 consumer-grade disks (WD10EZEX-00RKKA0) :

# ./pg_test_fsync -s 3
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                     115.375 ops/sec
        fdatasync                         109.369 ops/sec
        fsync                              27.081 ops/sec
        fsync_writethrough                            n/a
        open_sync                         112.042 ops/sec
...

You did test this on completely idle server, do you?

Maybe you have unaligned partitions. Check:

parted /dev/sda unit s print

This is the output of my server:

Model: ATA WDC WD10EZEX-00R (scsi)
Disk /dev/sda: 1953525168s
Sector size (logical/physical): 512B/4096B
Partition Table: msdos

Number  Start       End          Size         Type     File system     Flags
 1      2048s       67110911s    67108864s    primary  ext4            boot, raid
 2      67110912s   603981823s   536870912s   primary                  raid
 3      603981824s  608176127s   4194304s     primary  linux-swap(v1)
 4      608176128s  1953523711s  1345347584s  primary                  raid

Check that each number in Start column is divisible by 2048 (which means 1MiB). For good 4096B alignment divisible by 4 would suffice, but alignment aware utilities start partitions at 1MiB boundaries.

Also maybe you're using non-default mount options, like data=journal, which have big impact on performance. Show your: mount -v | grep ^/dev/. This is mine:

/dev/md0 on / type ext4 (rw,barrier,usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0)
/dev/md2 on /home type ext4 (rw,barrier,usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0)
/dev/md1 on /var type ext4 (rw,barrier,usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv0)

Maybe one of your disks is broken. Create one partition on each disk with no RAID (maybe you reserved some swap partitions on both disks - use these - there's no use for RAID on swap anyway). Create filesystems there and run pg_test_fsync on each drive - if one has problems then a good one would have to wait for it when both are mirrored.

Check that your BIOS is set to use AHCI mode instead of IDE mode. A server would benefit from Native Command Queuing, which is not available in IDE mode.

Ignore comparison with SSD. It's ridiculous to compare.

I ran bonnie++ which shows good performance (as good as you'd expect from regular sata disks). Also, the partitions are aligned. When I ran pg_test_fsync the first time it was on a VM. Then I ran it on the actual hardware, after shutting down every other process (including VMs). The performance was slightly better, around 40 ops/sec, which is still deplorable. I'm going to run some more tests, on separate partitions if I got the time today. Thanks for all the suggestions. — Blubber, Mar 27 '13 at 07:37
I have amended my original question with additional information about mount options and partition alignment. — Blubber, Mar 27 '13 at 07:52

score 1 · Answer 3 · answered Sep 18 '13 at 12:21

I know I might be way too late to answering this. I have been seeing similar performance problems with PostgreSQL and MySQL, when using O_DIRECT. I micro-benchmarked the system using iozone with sync writes (-+r option) and with and without O_DIRECT (-I option). Below are the two commands I used:

iozone -s 2g -r 512k -+r -I -f /mnt/local/iozone_test_file -i 0

and

iozone -s 2g -r 512k -+r    -f /mnt/local/iozone_test_file -i 0

The first is O_SYNC + O_DIRECT while the second is only O_SYNC. With the first I was getting roughly 30MB/sec and with the second I was getting about 220MB/sec (SSD drive). What I found out was that the has_journal option on ext4 seams to cause the problem. Don't know really why ...

Filesystem features:      has_journal

Once I took out this option, things started working fine, both tests reaching and sustaining 220MB/sec. To take out the option, you can use something like:

tune2fs -O ^has_journal /dev/sdX

You can give that a test and see if it solves your performance issues.

Disabling the journal on ext3/4 is not something to be done without careful consideration and a very good understanding of the impact. — ThatGraemeGuy, Sep 18 '13 at 13:20
I disagree. A DBMS does its own logging and recovery to provide durability and atomicity of transactions. The FS journal is completely useless in that regard. As long as fsync works properly, effects of committed transactions can always be restored. — Caetano Sauer, Oct 06 '14 at 09:42

PostgreSQL slow commit performance

3 Answers3