0

I have a pair of QNAP TS859U-RP nas devices running Linux version 2.6.33.2. These boxes are based on Debian. They are experiencing the same failure.

When I try to mount an nfs share as root as following:

mount nasb:/content /mnt/nasb/blah/  

It gives a timeout on the client:

mount: mount to NFS server 'nasb' failed: timed out (retrying).  
mount: mount to NFS server 'nasb' failed: timed out (retrying). 

There is also an error written to logs/kmsg on the nas like:

<6>rpc.mountd[522]: segfault at b73aa078 ip b76fdf56 sp bf8b03a4 error 7 in libuLinux_statistics.so.0.0[b76fd000+3000]  
<6>rpc.mountd[6539]: segfault at b7418078 ip b776bf56 sp bfd69a94 error 7 in libuLinux_statistics.so.0.0[b776b000+3000]  
... repeating

rpc.mountd is restarted, and the device momentarily stops serving nfs requests to all hosts. This device has that very same nfs share mounted on 23 other hosts and functioning as confirmed by running netstat -a | grep nfs | grep ESTAB | wc -l on the nas. I know there is some soft limit to the number of hosts that can mount a particular nfs share but either way it shouldn't segfault but return some error.

This is a software raid6 volume formatted with ext4. I tried contacting support a few times but have gotten nothing but the standard upgrade/reapply your firmware and reboot response. No satisfaction :( Any ideas on tracking down the error with rpc.mountd/libuLinux_statistics.so would be greatly appreciated.

Thank you in advance.

allexiusw
  • 181
  • 9
user121843
  • 21
  • 3
  • I'd start with strace-ing rpc.mountd, and see which syscall it's failing on. Alternately, if you can afford the downtime for the other clients, you might try attaching a debugger, "gdb -p then hit "c" for continue. Then trigger the error - you might see the segfault happen (which will then cause the debugger to stop the process), then you can run bt to see a backtrace. – malcolmpdx Mar 29 '11 at 03:25
  • Those are excellent suggestions. I will see about getting some debugging tools installed using ipkg on this nas. – user121843 Mar 29 '11 at 14:25
  • Thx again for the suggestions @malcolmpdx. I pasted the output of strace [here](http://pastebin.com/GKvVV1tQ). I don't see anything obvious right before the segfault but that could be due to my lack of interpretation skills. Much respect to anyone that can take a look at that and give some insight. – user121843 Mar 29 '11 at 15:03
  • @brandon - the call before the segfault is a shared memory function. What's your memory usage on the server look like? It's possible that's why many clients work, and then these two don't - lack of memory available. Easy test would be to turn off a few of the other clients, and then see if these two can connect. – malcolmpdx Mar 29 '11 at 15:11
  • These devices each have 1 GB of ram installed without about 20% of that being free. I can certainly try your suggestion during some off peak hour to add credit or remove validity to that theory. Good thinking. – user121843 Mar 29 '11 at 15:27
  • @brandon - upon further consideration, that shared memory call actually returned with a good value...likely a red herring. So, really, I'd get GDB involved here. – malcolmpdx Mar 29 '11 at 16:54
  • Thank you: @malcolmpdx. I think I have it narrowed down to something going south during the firmware upgrade which didn't update a file.. /usr/lib/libuLinux_statistics.so. I reapplied the firmware update, the file was updated, and the problem has yet to manifest itself again. I couldn't have done it (at least not in a reasonable time) without your suggestions. – user121843 Mar 30 '11 at 13:38

1 Answers1

0

Thank you: @malcolmpdx. I think I have it narrowed down to something going south during the firmware upgrade which didn't update a file.. /usr/lib/libuLinux_statistics.so. I reapplied the firmware update, the file was updated, and the problem has yet to manifest itself again. I couldn't have done it (at least not in a reasonable time) without your suggestions.

user121843
  • 21
  • 3