We have recently implemented high availability for our postgres (9.0.4) DB server, through the methods described as Log-Shipping Standby Servers in the Postgres documentation. Everything seems to be fine and working, the WAL files are shipping and are being ingested by the standby server, but we are experience lagging between the master and slave machines. The lag is of about 2 hours which is not really acceptable.
What could be the reason for this lag? The machine is not running anything else but the postgres server, although it does use slower hard drives compared to the production server. How can I check if disk I/O is causing issues?
If I check what processes are running on the server I see a constant battle between the postgres startup process which is recovering newest WAL files and the pg_standby utility which is ingesting the archived WALs step-by-step. Is it OK that the startup process is running constantly?
ps example:
postgres 1422 0.0 1.0 13061220 131568 ? S Sep20 0:01 /usr/pgsql-9.0/bin/postmaster -p 5433 -D /data/pgsql_5433/data
postgres 1431 0.0 0.0 176928 512 ? Ss Sep20 0:12 postgres: logger process
postgres 1432 70.5 72.0 13068604 8775544 ? Ss Sep20 5744:15 postgres: startup process waiting for 000000010000181F00000016
postgres 1437 0.2 70.4 13068336 8582736 ? Ss Sep20 22:50 postgres: writer process
postgres 32199 0.0 0.0 4064 484 ? S 01:46 0:00 /usr/pgsql-9.0/bin/pg_standby -l -t/data/pgsql_5433/trigger /data/pgsql_5433/psql_wal_import 000000010000181F00000016 pg_xlog/RECOVERYXLOG 000000010000181E00000051
I would appreciate any hint ...