Under high I/O DRBD will crash and take server down; is there anyway to optimize DRBD to prevent from happening again. listed below is my current config, errors and specs. if you need any more information please let me know. Thanks in advance.
Latest drbd config (same setting as the secondary):
[root@23 ~]# cat /etc/drbd.d/drbd0.res
resource drbd0 {
startup {
degr-wfc-timeout 30; # default is 2 minutes.
}
disk {
on-io-error detach;
fencing dont-care;
disk-barrier no;
disk-flushes no;
al-extents 3389;
}
net {
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 512k;
unplug-watermark 16;
after-sb-1pri discard-secondary;
}
on 23 {
device /dev/drbd0;
disk /dev/sdb1;
address 10.251.30.148:7789;
flexible-meta-disk internal;
}
on 23-t2 {
device /dev/drbd0;
disk /dev/sdb1;
address 10.48.25.66:7789;
flexible-meta-disk internal;
}
}
Error after crash:
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
INFO: task drbd_w_drbd1:2412 blocked for more that 120 seconds
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
INFO: task master:2506 blocked for more that 120 seconds
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
INFO: task java:2653 blocked for more that 120 seconds
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
INFO: task jbd2/drbd1-8:2234 blocked for more that 120 seconds
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
INFO: task cdpserver:2380 blocked for more that 120 seconds
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
INFO: task cdpserver:2396 blocked for more that 120 seconds
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
INFO: task cdpserver:2409 blocked for more that 120 seconds
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
INFO: task cdpserver:2416 blocked for more that 120 seconds
"echo 0 > proc/sys/kernel/hung_task_timeout_secs" disables this message
BUG: soft lockup - CPU#10 stuck for 67s! [scsi_eh_6:616]
BUG: soft lockup - CPU#10 stuck for 67s! [scsi_eh_6:616]
aacraid: acc_fib_send: first asynshronous command timed out
Usually a result of a PCI interrup routing problem"
update mother board BIOS or consider utilizing one of
the SAFE mode kernel option (acpi, apic etc)
Current setup:
CentOS release 6.3
2.6.32-279.5.2.el6.x86_64
drbd-8.4.1-1.el6.x86_64
2XE5620
12GM of mem
Adaptec 5805
/dev/drbd0 15T
/dev/drbd1 15T