We have a fiber channel san managed by two OpenSolaris 2009.06 NFS servers.
- Server 1 is managing 3 small volumes (300GB 15K RPM drives). It's working like a charm.
- Server 2 is managing 1 large volume of 32 drives (2TB 7200 RPM drives) RAID6. Total size is 50TB.
- Both servers have Zpool version 14 and ZFS version 3.
The slow 50TB server was installed a few month ago and was working fine. Users filled-up 2TB. I did a small experiment (created 1000 filesystems and had 24 snapshots on each). Everything when well as far as creating, accessing the filesystems with snapshots, and NFS mounting a few of them.
When I tried destroying the 1000 filesystems, the first fs took several minutes and then failed reporting the fs was in use. I issued a system shutdown but took more than 10 minutes. I did not wait longer and shut the power off.
Now when booting, OpenSolaris hangs. The lights on the 32 drives are blinking rapidly. I left it for 24 hours - still blinking but no progress.
I booted into an system snapshot before the zpool was created and tryied importing the zpool.
pfexec zpool import bigdata
Same situation: LEDs blinking and the import hangs forever.
Dtracing the "zpool import" process shows only the ioctl system call:
dtrace -n syscall:::entry'/pid == 31337/{ @syscalls[probefunc] = count(); }'
ioctl 2499
Is there a way to fix this? Edit: Yes. Upgrading OpenSolaris to svn_134b did the trick:
pkg publisher # shows opensolaris.org
beadm create opensolaris-updated-on-2010-12-17
beadm mount opensolaris-updated-on-2010-12-17 /mnt
pkg -R /mnt image-update
beadm unmount opensolaris-updated-on-2010-12-17
beadm activate opensolaris-updated-on-2010-12-17
init 6
Now I have zfs version 3. Bigdata zpool stays at version 14. And it's back in production!
But what was it doing with the heavy I/O access for more then 24 hours (before the software upgraded)?