2

When trying to replace a disk in a zpool on a FreeBSD 10.3-RELEASE-p20 system yesterday, the zfs filesystems became unresponsive after issuing the zpool detach srv gpt/d0 command. The server acts as an NFS server, WebDAV server and iSCSI target, and after executing zpool detach all iSCSI clients started experiencing timeouts.

This apparently caused the entire ZFS subsystem to lock up. zpool status or any other command would just hang and produce no output. There was nothing showing in dmesg, and top didn't show any processes consuming a large amount of CPU. In the end we were unable to find any solution and were forced to reboot the system (including using a hard reboot because a soft restart failed to restart the system after stopping all services) in order to get the iSCSI targets back online.

What causes this situation and how can we avoid it? How can we prevent zpool detach from hanging when replacing a device in a ZFS pool under FreeBSD?

Josh
  • 9,190
  • 28
  • 80
  • 128

2 Answers2

1

I'm unsure why this happens but we found that this issue was related to having ZFS Autoexpand enabled on the pool. Setting autoexpand=off using:

zpool set autoexpand=off srv

Allowed us to detach and replace further devices without zpool detach hanging in the same way.

I'm still interested to understand more about this failure mode, but I was answering my own question in hopes of sharing the knowledge that disabling autoexpand can resolve this issue.

Josh
  • 9,190
  • 28
  • 80
  • 128
1

Looks like this was fixed in 11.0-RELEASE: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=216881

I guess they didn't deem it worthy of a back port to 10.3-RELEASE-p22. :(