11

We recently upgraded a remote site from a 10/10Mbps fibre to a 20/20Mbps fibre link (it is fibre to the basement, then VDSL from the basement to the office, roughly 30 meters). There are regular large (multi-gig) file copies between this site and a central site, so the theory was that increasing the link to 20/20 should roughtly halve the transfer times.

For transfers for copying files (e.g. using robocopy to copy files in either direction, or Veeam Backup and Recovery's replication) they are capped at 10Mbps.

Before upgrade:

enter image description here

After upgrade (robocopy):

enter image description here

Almost identical (ignore the difference in length of time of the transfer).

The transfers are being done over an IPSec tunnel between a Cisco ASA5520 and a Mikrotik RB2011UiAS-RM.

First thoughts:

  • QoS - nope. There are QoS rules but none that should affect this flow. I disabled all the rules for a few minutes to check anyway, and no change
  • Software-defined limits. Most of this traffic is Veeam Backup and Recovery shipping off-site, but there are no limits defined in there. Additionally, I just did a straight robocopy and saw exactly the same statistics.
  • Hardware not capable. Well, a 5520's published performance figures are 225Mbps of 3DES data, and the Mikrotik doesn't publish numbers but it would be well over 10Mbps. The Mikrotik is at around 25%-33% CPU usage when doing these transfer tests. (Also, doing a HTTP transfer over the IPSec tunnel does hit close to 20Mbps)
  • Latency combined with TCP Window size? Well it's 15ms latency between the sites, so even a worst case 32KB window size of 32*0.015 is a maximum of 2.1MB/sec. Additionally multiple concurrent transfers still just add up to 10Mbps, which doesn't support this theory
  • Maybe the source and destination are both shit? Well the source can push 1.6GB/sec sustained sequential reads, so it's not that. The destination can do 200MB/sec sustained sequential writes, so it's not that either.

This is a very odd situation. I've never seen anything manifest quite in this manner before.

Where else can I look?


On further investigation, I'm confident in pointing to the IPSec tunnel as the problem. I made a contrived example and did some tests directly between two public IP addresses on the sites, and then did the exact same test using the internal IP addresses, and I was able to replicate 20Mbps over the unencrypted internet, and only 10Mbps on the IPSec side.


Previous version had a red herring about HTTP. Forget about this, this was a faulty testing mechanism.

As per the suggestion from Xeon and echo'd by my ISP when I asked them for support, I have set up a mangle rule to drop the MSS for the IPSec data to 1422 - based on this calculation:

 1422   +  20 + 4 +  4 +   16  +   0     +      1    +     1     +   12
PAYLOAD  IPSEC SPI ESP  ESP-AES ESP (Pad)  Pad Length Next Header ESP-SHA

To fit inside the ISP's 1480 MTU. But alas this has made no effective difference.


After comparing wireshark captures, the TCP session negotiates an MSS of 1380 at both ends now (after tweaking a few things and adding a buffer in case my maths sucks. Hint: it probably does). 1380 is also the ASA's default MSS anyway, so it may have been negotiating this the whole time anyway.


I'm seeing some strange data in the tool inside the Mikrotik that I've been using to measure the traffic. It could be nothing. I didn't notice this before as I was using a filtered query, and I only saw this when I removed the filter.

Mark Henderson
  • 68,823
  • 31
  • 180
  • 259
  • Whats the MTU's look like? – xeon Mar 26 '15 at 03:35
  • Good point. It's 9000 on both the switches at either end, 1500 on the server and clients themselves, and 1480 on the VDSL portion of the link. That's the only portions of the links that I control. – Mark Henderson Mar 26 '15 at 03:45
  • ping -t -f -l 1500 (decrease by 20 after failure) destination, once you are around 1300 I bet it will work, this should indicate you need to adjust MTU on ASA/Mikrotik IPsec tunnels or you might be able to set it so it does not drop the fragments to large. – xeon Mar 26 '15 at 04:03
  • `1394` is the largest MTU that I was able to get through. – Mark Henderson Mar 26 '15 at 04:29
  • Your data is being fragmented, so reducing the MTU on the tunnel to 1350-1380 should help increase throughput. IPsec overhead is around 84 bytes (depending on your encapsulation etc) so 1480 - 84 = 1396, close to your max you saw. – xeon Mar 26 '15 at 04:56
  • Those results seem right accounting for the 8 byte ICMP header, the 20 byte IP header, the 73 byte IPsec overhead and the 5 byte VDSL overhead (if my numbers are correct) for a total MTU of 1500 bytes. What do the interface counters on the switches and router show? Any overruns, drops, PAUSE frames, etc.? – joeqwerty Mar 26 '15 at 04:59
  • @joeqwerty The devices at both ends of the tunnel show 0 errors, 0 pause, a handful of underruns and nothing particularly out of the ordinary anywhere else. – Mark Henderson Mar 26 '15 at 05:14
  • I thought you'd see something for sure. It's strange that you didn't have this problem before the upgrade and also that it doesn't appear to affect HTTP transfers. – joeqwerty Mar 26 '15 at 05:30
  • I wonder what you'd see if you backed off the MTU of the VDSL link by the value of the IPsec overhead? But why wasn't this a problem before? Weird. – joeqwerty Mar 26 '15 at 05:42
  • Cool. Cisco has an IPsec overhead calculator: https://cway.cisco.com/tools/ipsec-overhead-calc/ipsec-overhead-calc.html – joeqwerty Mar 26 '15 at 05:49
  • Have you tried copying between different _internal_ IP's? Sure that you don't have any bandwidth restrictions between the 2 internal endpoints? – MichelZ Mar 26 '15 at 06:01
  • @joeqwerty well the problem probably did occur before the upgrade, but because it was a 10/10 link it was by coincidence capping out at the maximum available bandwidth anyway. – Mark Henderson Mar 26 '15 at 08:17
  • @MichelZ I have. Tried lots of different source/destinations with the same result – Mark Henderson Mar 26 '15 at 08:17
  • Still weird that it is more or less exactly 10 MBit... so I would still suspect something artificial.. Are you using SMB2? – MichelZ Mar 26 '15 at 08:19
  • @joeqwerty that calculator is great. When I'm back in the office tomorrow I'll muck around and see what's what. Right now it looks like my total packet size is going to be 1536 bytes, which is too big. – Mark Henderson Mar 26 '15 at 08:20
  • @MichelZ yes. At least I damn well hope so - Server 2012 R2 and Windows 7 at either end. – Mark Henderson Mar 26 '15 at 08:20
  • Server 2012 R2 supports SMB 3.02, while Windows 7 only supports SMB 2. I've had cases where transfers between systems with differing *maximum* supported versions have very similar symptoms. Can you try doing a file transfer between systems with the same versions? – GregL Mar 26 '15 at 11:59
  • @GregL no luck I'm afraid, same story. *However* I have concluded this morning that my initial tests were faulty, and it does not seem to be isolated to SMB, as with a different (more accurate) test, I can now reproduce this issue over HTTP. I've removed the HTTP red herring from my question now. – Mark Henderson Mar 26 '15 at 21:19
  • Any Hypervisors in between? – xeon Mar 27 '15 at 23:20
  • @xeon bare metal host to bare metal host, no difference. – Mark Henderson Mar 30 '15 at 02:16

2 Answers2

3

Even though CPU was the third thing I checked, and I wrote this:

The Mikrotik is at around 25%-33% CPU usage when doing these transfer tests

Which is confirmed by the CPU graph

enter image description here

I've had it confirmed by external resources (i.e. a bunch of other support forums and blogs) that most Mikrotik routers just cannot push more than 11Mbps of IPSec traffic with either 3DES or AES encryption, unless you get a model that has hardware encryption offloading.

So it looks like that this is just a hardware limitation. I should have caught it much earlier on, but for some reason the Mikrotik was not indicating to me that it was being CPU bound.

Off shopping I go.

Mark Henderson
  • 68,823
  • 31
  • 180
  • 259
  • I would be interested to know the specific limitation that is imposing this ceiling for IPSec traffic. Did any of your external sources explain it in more depth? – blacklight Mar 30 '15 at 03:34
  • Unfortunately not. I found some threads on the Mikrotik forums where 11Mbps was thrown around as the maximum for this router (and seems like I have confirmed this here). The blog I linked to the guy ran his tests and got around 1Mbps of traffic, but on a much, much lower powered router. Mine should be around 6-10x more powerful and I seem to be getting 6-10x the amount of IPSec traffic, which all matches up. It doesn't look like a CPU bound issue, or an IRQ bound issue, or a memory bound issue. I have no idea what's actually going on here. – Mark Henderson Mar 30 '15 at 03:43
2

I can confirm that the culprit is the CPU. Here I benchmarked a Mikrotik RB750GL and I measured 12 Mb/s with AES-128 traffic (and only 6.0 Mb/s with 3DES).

Your result seems perfectly in-line with what recorded by me.

shodanshok
  • 47,711
  • 7
  • 111
  • 180
  • It looks like the extra 200Mhz in speed between the 750 and the 2011 hasn't made any difference to the IPSec speeds. I wish Mikrotik would publish these figures somewhere. – Mark Henderson Mar 30 '15 at 23:43