2

Under high I/O DRBD will crash and take server down; is there anyway to optimize DRBD to prevent from happening again. listed below is my current config, errors and specs. if you need any more information please let me know. Thanks in advance.

Latest drbd config (same setting as the secondary):

[root@23 ~]# cat /etc/drbd.d/drbd0.res
resource drbd0 {
startup {
   degr-wfc-timeout 30;    # default is 2 minutes.
 }
 disk {
   on-io-error   detach;
   fencing dont-care;
   disk-barrier no;
   disk-flushes no;
   al-extents 3389;
 }
 net {
 max-buffers 8000;
 max-epoch-size 8000;
 sndbuf-size 512k;
 unplug-watermark 16;
 after-sb-1pri discard-secondary;
}

on 23 {
   device     /dev/drbd0;
   disk       /dev/sdb1;
   address    10.251.30.148:7789;
   flexible-meta-disk  internal;
 }

    on 23-t2 {
   device     /dev/drbd0;
   disk       /dev/sdb1;
   address    10.48.25.66:7789;
   flexible-meta-disk  internal;
 }

}

Error after crash:

"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
INFO: task drbd_w_drbd1:2412 blocked for more that 120 seconds 
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
INFO: task master:2506 blocked for more that 120 seconds 
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
INFO: task java:2653 blocked for more that 120 seconds 
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
INFO: task jbd2/drbd1-8:2234 blocked for more that 120 seconds 
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
INFO: task cdpserver:2380 blocked for more that 120 seconds
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
INFO: task cdpserver:2396 blocked for more that 120 seconds
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
INFO: task cdpserver:2409 blocked for more that 120 seconds
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
INFO: task cdpserver:2416 blocked for more that 120 seconds
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
BUG: soft lockup - CPU#10 stuck for 67s! [scsi_eh_6:616]
BUG: soft lockup - CPU#10 stuck for 67s! [scsi_eh_6:616]
aacraid: acc_fib_send: first asynshronous command timed out 
Usually a result of a PCI interrup routing problem"
update mother board BIOS or consider utilizing one of
the SAFE mode kernel option (acpi, apic etc)

Current setup:

CentOS release 6.3
2.6.32-279.5.2.el6.x86_64
drbd-8.4.1-1.el6.x86_64
2XE5620
12GM of mem
Adaptec 5805
/dev/drbd0             15T
/dev/drbd1             15T
  • 4
    You are not providing any useful information to go on. I have never seen DRBD crash, and we've been using it in production for many years now. Are you even sure there's a crash and not simply some form of resource-level fencing going on? Please provide any relevant logfiles, configuration files of your DRBD setup, and, if applicable, your cluster manager, as well as the exact circumstances of the alleged crashing. Does your DRBD setup crash on *both* machines simultaneously? What does "under load" mean? IO? Network? CPU? All of the above? – daff Oct 02 '12 at 22:16

1 Answers1

0

You still haven't explained what crashing means in this context. In your "after crash" messages it certainly looks like DRBD is still running. What does cat /proc/drbd say after the event? What of ps -ef|grep -i [d]rbd?

Anyway, to me it looks like your disks and/or storage controller are not performing well enough to sustain high IO load and thus make the system and especially DRBD wait too long while flushing writes to disk. If that is the case then this is problem with your hardware setup and not DRBD. But to be certain you might want to take this up to the DRBD mailing list.

daff
  • 4,809
  • 2
  • 28
  • 27