5

Edit : The issue is resolved. The Queues in question have been used for Flow Control Packets. Why the igb driver propagated FC packets up to have them dropped (and counted) is another question. But the solution is, that there is nothing dropped in a way that data got lost.

Thank you very much, syneticon-dj, your pointer to dropwatchwas gold!

=== original question for further reference ===

we have the following situation:

System: The server in question is a dell poweredge with 4 quad-core xenon cpus, 128GB ECC RAM and is running debian linux. The kernel is 3.2.26.
The interfaces in question are special iSCSI cards with four interfaces each using Intel 82576 Gigabit Ethernet Controller.

Background: On one of our servers a lot NAS (Thecus N5200 and Thecus XXX) are connected using iSCSI on dedicated 1GB/s interfaces. We have 5 cards with 4 ports each. The NAS filers are connected directly, no switch in between.

Two weeks ago we managed to clear four NAS filers and used them to build a raid6 using mdadm on them. Using LVM this allows us to dynamically create, shrink and/or grow storage for our various projects instead of searching all our NAS filers for free space every now and so often.

However, we got a lot of overruns on pretty much every interface and a lot packets got dropped. Investigations showed, that the default settings for the networking stack(s) had to be increased. I used sysctl to tweak all settings until no more overruns occurred.

Unfortunately the interfaces that are used for the NAS raid still drop a lot of packets, but only RX.

After searching (here, google, metager, intel, anywhere, everywhere) we found information about the intel igb drivers to have some problems and that some work has to be done.

Thus I downloaded the latest version (igb-4.2.16), compiled the module with LRO and separate queues support, and installed the new module.

All 20 (!) interfaces using this driver now have 8 RxTx queues (unpaired) and have LRO enabled. The concrete options line is:

options igb InterruptThrottleRate=1 RSS=0 QueuePairs=0 LRO=1

irqbalancer is nicely distributing the queues of all interfaces and everything works splendid.

So why am I writing? We have the following odd situation and simply can not explain it:

Three of the five interfaces for the NAS raid (We have added one spare NAS, and the raid should be grown once mdadm has finished its current reshape) show a massive amount (millions!) of packet drops.

Investigations with ethtool now show, thanks to the new multiple-queue-enabled drivers, that each interfaces uses one queue massively, this will be the reshape we guess.

But three use another queue with millions of incomming packets, which all get dropped. At least showed investigations utilizing 'watch', that the packet numbers on these queues correlate with the dropped packages.

We changed the MTU on the NAS and interfaces from 9000 down to 1500, but the packet drop rate increased and the mdadm performance went down. Thus it does not look like an MTU problem. Further the network stack has insane amounts of memory to its disposal, this shouldn't be a problem either. backlogs are large enough (huge in fact) and we are completely at sea.

Have example output here:

~ # for nr in 2 3 4 5 9 ; do eth="eth1${nr}" ; echo " ==== $eth ==== " ; ethtool -S $eth | \
> grep rx_queue_._packet | grep -v " 0" ; ifconfig $eth | grep RX | grep dropped ; \
> echo "--------------" ; done
==== eth12 ==== 
    rx_queue_0_packets: 114398096
    rx_queue_2_packets: 189529879
          RX packets:303928333 errors:0 dropped:114398375 overruns:0 frame:0
--------------
==== eth13 ==== 
    rx_queue_0_packets: 103341085
    rx_queue_1_packets: 163657597
    rx_queue_5_packets: 52
          RX packets:266998983 errors:0 dropped:103341256 overruns:0 frame:0
--------------
==== eth14 ==== 
    rx_queue_0_packets: 106369905
    rx_queue_4_packets: 164375748
          RX packets:270745915 errors:0 dropped:106369904 overruns:0 frame:0
--------------
==== eth15 ==== 
    rx_queue_0_packets: 161710572
    rx_queue_1_packets: 10
    rx_queue_2_packets: 10
    rx_queue_3_packets: 23
    rx_queue_4_packets: 10
    rx_queue_5_packets: 9
    rx_queue_6_packets: 81
    rx_queue_7_packets: 15
          RX packets:161710730 errors:0 dropped:4504 overruns:0 frame:0
--------------
==== eth19 ==== 
    rx_queue_0_packets: 1
    rx_queue_4_packets: 3687
    rx_queue_7_packets: 32
          RX packets:3720 errors:0 dropped:0 overruns:0 frame:0
--------------

The new spare drive is attached to eth15.
As you can see, there are no overruns and no errors. And the adapters report, that they did not drop a single packet. Thus it is the kernel throwing data away. But why?

edit: I forgot to mention that eth12 to eth15 are all located on the same card. eth19 on another.

Does anybody have ever witnessed such strange behaviour, and was there a solution to remedy the situation?

And even if not, does anybody know a method with which we could at least find out which process occupies the dropping queues?

Thank you very much in advance!

Yamakuzure
  • 153
  • 6
  • 1
    This is a terrible setup. But an interesting technical question. – ewwhite Jun 08 '13 at 12:09
  • What exactly is terrible? I have found plenty of howtos and tutorial on setting up NAS RAID using iSCSI. At least it does not seem to be this uncommon, does it? – Yamakuzure Jun 08 '13 at 20:27
  • This is definitely an uncommon setup. You'd be better served by building/buying an appropriately-sized storage unit with the right types of disks/interconnects and capacity. – ewwhite Jun 08 '13 at 21:37
  • @Yamakuzure what's uncommon is your use of numerous interfaces and direct connections instead of (redundant) switches. Also, using "dumb" storage devices to provide storage space via iSCSI to build an md array is surely technically possible, but typically an approach involving an "intelligent" storage providing resilience and sufficient space under a unified management interface would be preferred. I see how you are trying to create a low-cost SAN, but you likely are maneuvering yourself into more trouble than you can imagine. – the-wabbit Jun 09 '13 at 00:23
  • @ewwhite: Unfortunately an "appropriately-sized storage unit" that can replace 23 5.2TB NAS does not exist imho. However, we have those filers now and simply want to consolidate them. – Yamakuzure Jun 10 '13 at 07:49
  • @syneticon-dj We have already discussed the change to redundant switches. The question is, whether the performance impact is acceptable when lowering the amount of 1GB lines drastically. Could you please explain what you mean with "dumb storage devices"? All filers are set up using RAID5, which has helped us already when a hard drive failed. Currently the set up allows us to keep on working even if one hard drive on each NAS fails and it would even keep working when two NAS fail completely. (The rebuild/reshape time is rather unholy of course.) – Yamakuzure Jun 10 '13 at 07:56
  • from what I know about the N5200, it runs a low-end old Celeron CPU and reaches sustained transfer rates of about 20 MB/s max(for reading) - you likely would not miss all that much in performance, even when running 5 of them on a single GB link. But I actually meant to advise you on running switches with 10 GB interfaces. Also, 23x5,2TB sounds like a lot at first, but if you look at 60-bay 4U JBODs like http://www.dataonstorage.com/dataon-products/6g-sas-jbod/dns-1660-4u-60-bay-6g-35inch-sassata-jbod.html, you quickly realize that this could be easily accomplished in much less rack space. – the-wabbit Jun 10 '13 at 09:36
  • @syneticon-dj That storage looks nice, thank you for the hint! – Yamakuzure Jun 10 '13 at 13:11
  • About The N5200 performance: It can do over 50MB/s writing with concurrent ~90MB/s reading in "Overlapped I/O". [Article@hardwaresecrets.com](http://www.hardwaresecrets.com/article/Thecus-N5200-NAS-Review/606) - at least in lab conditions. However, it is a lot faster than 20 MB/s from my own experience. (I just rsync'd a 1.8GB file from one of the filers somewhere else, and while both the server and the NAS are anything but idle rsync reported an average of 43MB/sec. So I guess you assumption of 20MB/s are a bit low.) – Yamakuzure Jun 10 '13 at 13:34

1 Answers1

6

You have enough interfaces to build a workgroup switch with. As this configuration is not employed as often and thus not tested as thoroughly, expect oddities coming from that alone.

Also, as your setup is quite complex, you should try isolating the issue by simplifying it. This is what I would do:

  1. rule out the simple cases, e.g. by checking the link stats by issuing /sbin/ethtool -S <interface> to see if the drops are a link-related problem
  2. as the NICs are making use of interrupt coalescing, increase the ring buffer and see if it helps matters
  3. use dropwatch to get a better idea if any other buffers could be increased
  4. disable multiqueue networking again - with 20 active interfaces there hardly will be a situation where multiple queues per interface would gain any performance and from your description it might be a queuing-related problem
  5. reduce the number of interfaces and see if the problem persists
  6. if nothing else helps, post a question to the Kernel netdev mailing list
the-wabbit
  • 40,737
  • 13
  • 111
  • 174
  • Thanks. I wasn't aware of `dropwatch.` – ewwhite Jun 08 '13 at 12:08
  • To 1: We have observed the link stat by using `watch`on ethtool. Everything is stable. 2: `watch -g` did not show anything suspicious. But I'll keep that in mind and will try what happens after enlarging the ring buffer. 3: I'll give `dropwatch`a try, thank you very much for that idea! 4: Well, we had the drop rates with single-queues, too. 5: Well, almost all interfaces are currently used (and needed!) only two are spare. 6: I bet your ideas 2 and 3 will bring me towards finding a solution. (I dodn't know `dropwatch`either!) If not I'll try that. Thank you very much! – Yamakuzure Jun 08 '13 at 20:42
  • The issue is solved! The dropped packets are simply rx_flow_control_xon and xoff packets. Three of the filers simply aren't fast enough. And that is the reason why the "drop rate" went up when we lowered MTU down to 1500. Flow control works as intended. Why the driver let's the FC packets move up so they get dropped (and counted) is a riddle for me, but at least there is nothing bad happening. ("dropwatch" was what gave the final clue) – Yamakuzure Jun 10 '13 at 13:15
  • @Yamakuzure glad it could be resolved. The only thing that puzzles me is that Ethernet does not have a notion of XON and XOFF for flow control - it just uses PAUSE frames which are destined for a multicast address. It might be though that the internal Kernel structures mimic XON/XOFF behavior for compatibility with other transmission protocols. As the PAUSE frames are addressed to a valid multicast destination, it does not seem wrong to see them counted. As the destination address is not owned by the interface, it does not seem wrong to see them dropped, although I can see how it is confusing. – the-wabbit Jun 10 '13 at 15:01
  • @Yamakuzure BTW: I *told* you the N5200 were dog-slow ;) – the-wabbit Jun 10 '13 at 15:03