2

I've 4 NetApp 2240-4 filer heads. They're single chassis 'cluster in a box' so two separate units.

Over the last few days, at about the same time - all of them started logging a LOT of Low water mark consistency points.

Running wafl_susp -w gives me cp_from_low_water clocking up at a rate of 10/sec or more. Before this started, they were almost entirely cp_from_timer at a rate of 1 every 10s or so.

Two of my boxes have become unresponsive and been rebooted, and the problem has now gone again. I'm not 100% sure that's connected, but it seems a reasonable bet as to a culprit.

The other two - are completely idle, as in they have a base OS, and a couple of vfilers - and nothing else. But yet - Low watermark, suggests they're running out of memory, for some reason. I can only assume some sort of denial of service condition is occurring (perhaps 'failed SSH logins'?).

Can anyone offer an insight on how to troubleshoot this? Specifically from a NetApp perspective, I'm looking for some hints as to how to extract what's hogging my memory.

Sobrique
  • 3,747
  • 2
  • 15
  • 36

2 Answers2

2

Open a ticket- this is an indication that there's a lack of system memory, and if there's little work being done and you still had boxes go unresponsive, there's something screwy happening. I've walked through the process of inspecting internal memory usage before with support on the line, but it's not something clients are supposed to do on their own. You'd need to use a priv set command and check running processes.

Basil
  • 8,851
  • 3
  • 38
  • 73
  • I'm setting things in motion with NetApp, I know that's the right port of call. I'm just also quite keen on the notion of rolling up my sleeves to troubleshoot. `wafl_susp -w` requires `diag`. – Sobrique Jan 08 '15 at 15:54
  • yes, you'll probably end up setting up a diag elevated session and running something like mem_stats, but it's not well documented publicly and, as far as I know, can actually cause trouble if not used properly. Once you get a call back, don't let them remote into the box, insist that they walk you through it. And then take notes and ask questions :) – Basil Jan 08 '15 at 15:56
  • Obviously logged this at too low a priority, because the boxes have become unresponsive in the meantime, and I've had to power cycle. – Sobrique Jan 12 '15 at 11:20
  • Is this in production yet? If so, there's a special option on the netapp support line that may put you through faster. – Basil Jan 12 '15 at 18:10
  • Nope, but the problem's gone now. I'll hold off until it reoccurs, and then nag 'em. – Sobrique Jan 12 '15 at 18:21
  • Looks like we found a mem leak bug, in LDAP+SASL – Sobrique Feb 23 '15 at 14:22
  • Lucky you! What firmware level? I run those two as well... – Basil Feb 24 '15 at 15:28
  • 8.1.3 - fixed in 8.2.3 (My 8.0.2 controllers aren't showing it). Bug ID 697790 - from my monitoring, looks like it's about 32 bytes per failed LDAP auth. On low memory controllers, a locked account is enough to knock them over 4-6 weeks. My 6280s are still 'fine' (maybe - I might be having perf problems) after 700 days we're on 6G memory. – Sobrique Feb 24 '15 at 15:37
0

Case opened with vendor regarding problem.

Low Water Mark CPs are the result of memory exhaustion: (Vendor link)

CP caused by low water mark; the amount of memory available for routine housekeeping tasks is low enough that it is ideal to start a CP to release some more

To interface with vendor, we ran a 'perfstat' - a NetApp downloadable tool that allows submitting perf related support information. This lead us to bug ID 697790 (Support login required), present on the version of code we were on, fixed in ONTAP 8.2.3

Specifically a memory leak in the specific case where LDAP authentication was failing. Because all 4 hosts were using the same account, and because at some point the lockout had tripped, they were all failing absurdly frequently. (And were specifically very low memory systems in the first place).

I have looked at other systems where this bug has been present, and there's some signs of it happening, but even on systems with 700+ day uptime an insignificant amount had occurred.

In general (and with a caveat that 'diag' commands are potentially dangerous to use so should be done with extreme caution without talking to the vendor) - we could identify the problem by looking at mem_stat - second column is 'bytes' and look for 'sasl'.

1306719 5268691008 maytag.ko::sasl_client_new+149

I don't know at what level the problem crops up - I'm waiting for the systems to crash again to check. But would suggest that over 5% memory utilisation you should be considering taking action. A reboot fixes, as does a code update.

Am now capturing cp_types and memory footprint as part of my monitoring regime, so I can observe it occuring. Also being a bit more proactive about spotting LDAP account lockouts.

Sobrique
  • 3,747
  • 2
  • 15
  • 36