3

We have 2 sites linked together with VPN tunnel (Fortigate 60C devices). On each site I have ESXi server with a couple of VMs. Normally, everything works fine.

Site 1 (S1) subnet is 192.168.254.0/24, with Machine A1, A2 on ESXi1
Site 2 (S2) subnet is 192.168.253.0/24, with Machine B1, B2 on ESXi2

All ping between those machines works normally through VPN tunnel.

Suddently, S1-A1 cannot ping S2-B1 anymore, but S2-B1 still ping S1-A1.

All pings (using IP addresses) accross all machines (VMs and ESXi) works except from S1-A1 -> S2-B1.

Traceroute results were:
S1-A1 -> S2-B1 -> through Internet (?????)
S1-A1 -> S2-B2 -> through VPN Tunnel
S2-B2 -> S1-A1 -> through VPN Tunnel
S1-A1 -> S2-ESXi2 -> through VPN Tunnel

Machine A1 is a Windows 2003 R2 - SP2. There is 5 IP addresses binded on the NIC. I tried to disable and enable the NIC but the network management stopped responding. Only a reboot fixed the problem.

route print did not change. Gateway is the same and no specific route to reach B2.

arp -a did not show anything related to 192.168.253.0/24.

I don't understand why S1-A1 -> S2-ESXi2 worked but but not S1-A1 -> S2-B1 because B2 (192.168.253.18) is running on ESXi2 (192.168.253.23).

Registry entry of the Network Interface

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\{0E114693-5FC8-4AA4-AB98-14CE43E24DE5}]
"UseZeroBroadcast"=dword:00000000
"EnableDeadGWDetect"=dword:00000001
"EnableDHCP"=dword:00000000
"IPAddress"=hex(7):31,00,39,00,32,00,2e,00,31,00,36,00,38,00,2e,00,32,00,35,00,\
  34,00,2e,00,31,00,35,00,00,00,31,00,39,00,32,00,2e,00,31,00,36,00,38,00,2e,\
  00,32,00,35,00,34,00,2e,00,31,00,32,00,00,00,31,00,39,00,32,00,2e,00,31,00,\
  36,00,38,00,2e,00,32,00,35,00,34,00,2e,00,31,00,33,00,00,00,31,00,39,00,32,\
  00,2e,00,31,00,36,00,38,00,2e,00,32,00,35,00,34,00,2e,00,31,00,35,00,31,00,\
  00,00,31,00,39,00,32,00,2e,00,31,00,36,00,38,00,2e,00,32,00,35,00,34,00,2e,\
  00,34,00,30,00,00,00,00,00

  which is 192.168.254.15 192.168.254.12 192.168.254.13 192.168.254.151 192.168.254.40

"SubnetMask"=hex(7):32,00,35,00,35,00,2e,00,32,00,35,00,35,00,2e,00,32,00,35,\
  00,35,00,2e,00,30,00,00,00,32,00,35,00,35,00,2e,00,32,00,35,00,35,00,2e,00,\
  32,00,35,00,35,00,2e,00,30,00,00,00,32,00,35,00,35,00,2e,00,32,00,35,00,35,\
  00,2e,00,32,00,35,00,35,00,2e,00,30,00,00,00,32,00,35,00,35,00,2e,00,32,00,\
  35,00,35,00,2e,00,32,00,35,00,35,00,2e,00,30,00,00,00,32,00,35,00,35,00,2e,\
  00,32,00,35,00,35,00,2e,00,32,00,35,00,35,00,2e,00,30,00,00,00,00,00

  which is 255.255.255.0 255.255.255.0  255.255.255.0  255.255.255.0 255.255.255.0

"DefaultGateway"=hex(7):31,00,39,00,32,00,2e,00,31,00,36,00,38,00,2e,00,32,00,\
  35,00,34,00,2e,00,32,00,35,00,34,00,00,00,00,00
  which is 192.168.254.254


"DefaultGatewayMetric"=hex(7):30,00,00,00,00,00
"NameServer"="192.168.254.254"
"Domain"=""
"RegistrationEnabled"=dword:00000001
"RegisterAdapterName"=dword:00000000
"TCPAllowedPorts"=hex(7):30,00,00,00,00,00
"UDPAllowedPorts"=hex(7):30,00,00,00,00,00
"RawIPAllowedProtocols"=hex(7):30,00,00,00,00,00
"NTEContextList"=hex(7):00,00
"DhcpClassIdBin"=hex:
"DhcpServer"="255.255.255.255"
"Lease"=dword:00000e10
"LeaseObtainedTime"=dword:51185713
"T1"=dword:51185e1b
"T2"=dword:51186361
"LeaseTerminatesTime"=dword:51186523
"IPAutoconfigurationAddress"="0.0.0.0"
"IPAutoconfigurationMask"="255.255.0.0"
"IPAutoconfigurationSeed"=dword:00000000
"AddressType"=dword:00000000

I exclude the Fortigates as part of the problem since just needed to reboot A1.

2013-09-19 : Issue again. Seems to occur everytime the VPNs drops between the Fortigates.

HOCHELAGA_2 # get router info routing-table all
Codes: K - kernel, C - connected, S - static, R - RIP, B - BGP
       O - OSPF, IA - OSPF inter area
       N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2
       E1 - OSPF external type 1, E2 - OSPF external type 2
       i - IS-IS, L1 - IS-IS level-1, L2 - IS-IS level-2, ia - IS-IS inter area
       * - candidate default

S*      0.0.0.0/0 [10/0] via 64.15.130.49, wan1
C       10.10.10.0/24 is directly connected, dmz
C       10.100.254.1/32 is directly connected, fat
C       10.100.254.2/32 is directly connected, fat
C       64.15.130.48/28 is directly connected, wan1
                        is directly connected, wan1
                        is directly connected, wan1
                        is directly connected, wan1
                        is directly connected, wan1
                        is directly connected, wan1
S       192.168.200.0/24 [10/0] via 10.100.254.2, fat
C       192.168.250.0/24 is directly connected, internal
S       192.168.252.0/24 [10/0] is directly connected, hoch st-bruno
S       192.168.253.0/24 [10/0] is directly connected, HOCH-KAN
C       192.168.254.0/24 is directly connected, internal
                         is directly connected, internal
                         is directly connected, internal


HOCHELAGA_2 # diagnose ip route list
tab=254 vf=0 scope=253 type=1 proto=2 prio=0 0.0.0.0/0.0.0.0/0->10.100.254.2/32 pref=10.100.254.1 gwy=0.0.0.0 dev=11(fat)
tab=254 vf=0 scope=253 type=1 proto=2 prio=0 0.0.0.0/0.0.0.0/0->64.15.130.48/28 pref=64.15.130.56 gwy=0.0.0.0 dev=3(wan1)
tab=254 vf=0 scope=253 type=1 proto=2 prio=0 0.0.0.0/0.0.0.0/0->169.254.0.64/26 pref=169.254.0.66 gwy=0.0.0.0 dev=16(havdlink1)
tab=254 vf=0 scope=0 type=1 proto=11 prio=0 0.0.0.0/0.0.0.0/0->192.168.200.0/24 pref=0.0.0.0 gwy=10.100.254.2 dev=11(fat)
tab=254 vf=0 scope=253 type=1 proto=2 prio=0 0.0.0.0/0.0.0.0/0->192.168.250.0/24 pref=192.168.250.254 gwy=0.0.0.0 dev=5(internal)
tab=254 vf=0 scope=253 type=1 proto=2 prio=0 0.0.0.0/0.0.0.0/0->10.10.10.0/24 pref=10.10.10.1 gwy=0.0.0.0 dev=4(dmz)
tab=254 vf=0 scope=0 type=1 proto=11 prio=0 0.0.0.0/0.0.0.0/0->192.168.252.0/24 pref=0.0.0.0 gwy=0.0.0.0 dev=9(hoch st-bruno)
tab=254 vf=0 scope=0 type=1 proto=11 prio=0 0.0.0.0/0.0.0.0/0->192.168.253.0/24 pref=0.0.0.0 gwy=0.0.0.0 dev=10(HOCH-KAN)
tab=254 vf=0 scope=253 type=1 proto=2 prio=0 0.0.0.0/0.0.0.0/0->192.168.254.0/24 pref=192.168.254.254 gwy=0.0.0.0 dev=5(internal)
tab=254 vf=0 scope=0 type=1 proto=11 prio=0 0.0.0.0/0.0.0.0/0->0.0.0.0/0 pref=0.0.0.0 gwy=64.15.130.49 dev=3(wan1)
tab=255 vf=0 scope=253 type=3 proto=2 prio=0 0.0.0.0/0.0.0.0/0->64.15.130.63/32 pref=64.15.130.56 gwy=0.0.0.0 dev=3(wan1)
tab=255 vf=0 scope=253 type=3 proto=2 prio=0 0.0.0.0/0.0.0.0/0->127.255.255.255/32 pref=127.0.0.1 gwy=0.0.0.0 dev=7(root)
tab=255 vf=0 scope=254 type=2 proto=2 prio=0 0.0.0.0/0.0.0.0/0->10.10.10.1/32 pref=10.10.10.1 gwy=0.0.0.0 dev=4(dmz)
tab=255 vf=0 scope=253 type=3 proto=2 prio=0 0.0.0.0/0.0.0.0/0->10.10.10.0/32 pref=10.10.10.1 gwy=0.0.0.0 dev=4(dmz)
tab=255 vf=0 scope=254 type=2 proto=2 prio=0 0.0.0.0/0.0.0.0/0->10.100.254.1/32 pref=10.100.254.1 gwy=0.0.0.0 dev=11(fat)
tab=255 vf=0 scope=254 type=2 proto=2 prio=0 0.0.0.0/0.0.0.0/0->64.15.130.59/32 pref=64.15.130.59 gwy=0.0.0.0 dev=2(wan2)
tab=255 vf=0 scope=254 type=2 proto=2 prio=0 0.0.0.0/0.0.0.0/0->192.168.254.2/32 pref=192.168.254.254 gwy=0.0.0.0 dev=5(internal)
tab=255 vf=0 scope=254 type=2 proto=2 prio=0 0.0.0.0/0.0.0.0/0->64.15.130.58/32 pref=64.15.130.56 gwy=0.0.0.0 dev=3(wan1)
tab=255 vf=0 scope=254 type=2 proto=2 prio=0 0.0.0.0/0.0.0.0/0->169.254.0.66/32 pref=169.254.0.66 gwy=0.0.0.0 dev=16(havdlink1)
tab=255 vf=0 scope=253 type=3 proto=2 prio=0 0.0.0.0/0.0.0.0/0->192.168.250.0/32 pref=192.168.250.254 gwy=0.0.0.0 dev=5(internal)
tab=255 vf=0 scope=254 type=2 proto=2 prio=0 0.0.0.0/0.0.0.0/0->192.168.254.1/32 pref=192.168.254.254 gwy=0.0.0.0 dev=5(internal)
tab=255 vf=0 scope=254 type=2 proto=2 prio=0 0.0.0.0/0.0.0.0/0->64.15.130.57/32 pref=64.15.130.56 gwy=0.0.0.0 dev=3(wan1)
tab=255 vf=0 scope=253 type=3 proto=2 prio=0 0.0.0.0/0.0.0.0/0->192.168.254.0/32 pref=192.168.254.254 gwy=0.0.0.0 dev=5(internal)
tab=255 vf=0 scope=254 type=2 proto=2 prio=0 0.0.0.0/0.0.0.0/0->64.15.130.56/32 pref=64.15.130.56 gwy=0.0.0.0 dev=3(wan1)
tab=255 vf=0 scope=253 type=3 proto=2 prio=0 0.0.0.0/0.0.0.0/0->169.254.0.64/32 pref=169.254.0.66 gwy=0.0.0.0 dev=16(havdlink1)
tab=255 vf=0 scope=254 type=2 proto=2 prio=0 0.0.0.0/0.0.0.0/0->64.15.130.54/32 pref=64.15.130.56 gwy=0.0.0.0 dev=3(wan1)
tab=255 vf=0 scope=253 type=3 proto=2 prio=0 0.0.0.0/0.0.0.0/0->169.254.0.127/32 pref=169.254.0.66 gwy=0.0.0.0 dev=16(havdlink1)
tab=255 vf=0 scope=254 type=2 proto=2 prio=0 0.0.0.0/0.0.0.0/0->64.15.130.53/32 pref=64.15.130.56 gwy=0.0.0.0 dev=3(wan1)
tab=255 vf=0 scope=254 type=2 proto=2 prio=0 0.0.0.0/0.0.0.0/0->64.15.130.52/32 pref=64.15.130.56 gwy=0.0.0.0 dev=3(wan1)
tab=255 vf=0 scope=253 type=3 proto=2 prio=0 0.0.0.0/0.0.0.0/0->10.10.10.255/32 pref=10.10.10.1 gwy=0.0.0.0 dev=4(dmz)
tab=255 vf=0 scope=253 type=3 proto=2 prio=0 0.0.0.0/0.0.0.0/0->127.0.0.0/32 pref=127.0.0.1 gwy=0.0.0.0 dev=7(root)
tab=255 vf=0 scope=254 type=2 proto=2 prio=0 0.0.0.0/0.0.0.0/0->192.168.254.254/32 pref=192.168.254.254 gwy=0.0.0.0 dev=5(internal)
tab=255 vf=0 scope=253 type=3 proto=2 prio=0 0.0.0.0/0.0.0.0/0->192.168.250.255/32 pref=192.168.250.254 gwy=0.0.0.0 dev=5(internal)
tab=255 vf=0 scope=254 type=2 proto=2 prio=0 0.0.0.0/0.0.0.0/0->127.0.0.1/32 pref=127.0.0.1 gwy=0.0.0.0 dev=7(root)
tab=255 vf=0 scope=253 type=3 proto=2 prio=0 0.0.0.0/0.0.0.0/0->64.15.130.48/32 pref=64.15.130.56 gwy=0.0.0.0 dev=3(wan1)
tab=255 vf=0 scope=254 type=2 proto=2 prio=0 0.0.0.0/0.0.0.0/0->192.168.250.254/32 pref=192.168.250.254 gwy=0.0.0.0 dev=5(internal)
tab=255 vf=0 scope=253 type=3 proto=2 prio=0 0.0.0.0/0.0.0.0/0->192.168.254.255/32 pref=192.168.254.254 gwy=0.0.0.0 dev=5(internal)
tab=255 vf=0 scope=254 type=2 proto=2 prio=0 0.0.0.0/0.0.0.0/0->127.0.0.0/8 pref=127.0.0.1 gwy=0.0.0.0 dev=7(root)

PING succesfull on a server

diagnose sniffer packet any "host 192.168.253.23" 4

23.232067 internal in 192.168.254.15 -> 192.168.253.23: icmp: echo request
23.232329 HOCH-KAN out 192.168.254.15 -> 192.168.253.23: icmp: echo request
23.248800 HOCH-KAN in 192.168.253.23 -> 192.168.254.15: icmp: echo reply
23.248932 internal out 192.168.253.23 -> 192.168.254.15: icmp: echo reply

PING failed on a server

diagnose sniffer packet any "host 192.168.253.18" 4

8.212249 internal in 192.168.254.15 -> 192.168.253.18: icmp: echo request
8.212479 wan1 out 64.15.130.56 -> 192.168.253.18: icmp: echo request
10.508155 internal in 192.168.254.15.1113 -> 192.168.253.18.139: syn 1271941747
10.508436 wan1 out 64.15.130.56.42334 -> 192.168.253.18.139: syn 1271941747
11.706287 internal in 192.168.254.15.1112 -> 192.168.253.18.445: syn 341420858
11.706540 wan1 out 64.15.130.56.42332 -> 192.168.253.18.445: syn 341420858

Why the route taken is different for the server on the same network ? I don't use any RIP, OSPF, BGP routing. No policy routing. Juste a static route between VPNs. Nothing is showing a dynamic route for 192.168.253.23 and the Fortigate decide to route it into the wan1 interface instead.

Anything I could check next time it happens ?

Thank in advance
And sorry if is not fully clear, french is my mother language
S.

sbrisson
  • 131
  • 1
  • 4
  • "Anything I could check next time it happens ?" You are saying that a reboot fixed it completely and you can ping S1-A1 to S2-B1 again fine? – TheCleaner Sep 09 '13 at 14:38
  • Yes. The reboot fixed it completely. I have realtime file replication between A1->B1 running since a couple of month. I was alerted by my monitoring system running on B1 on saturday morning that the files were possibly too old (meaning that the replication was not occuring). – sbrisson Sep 09 '13 at 14:51
  • 1
    It's obvious from your tracert results what caused the problem. A's connection to B went through the internet instead of the VPN. The next time this happens look for the same condition and then troubleshoot that. Also, connectivity to the host doesn't imply connectivity to the guest. Host network connectivity doesn't automatically proffer guest network connectivity. – joeqwerty Sep 09 '13 at 15:27
  • @joeqwerty, that's exactly what I tought... route problem. But why 192.168.253.18 is taking another route ? Everything is routed to the gateway (the Fortigate), so the problem *can be* the Fortigate route table (which I verified and it did not changed). But what puzzled me it's that the reboot of the machine fixed it ! So I suspect a windows networking issue. – sbrisson Sep 09 '13 at 16:00
  • If your Windows box gets an ICMP unreachable for a route, it adds a static route for the single IP to the routing table. It's called dead gateway detection, and is a setting that can be controlled. Do you have policy or route (interface) based VPNs? – brandeded Sep 10 '13 at 11:46
  • @mbrownnyc, I included in the post the registry export of the interface. As you may see, dead gateway detection is ON put I just have only 1 gateway defined. If it adds a static route for the single IP, will `route print` show it ? Because while I was getting the issue, routes were fine. No single IP route was showing with the `route print` command. – sbrisson Sep 10 '13 at 14:20
  • Salut! Yes, `route print` will show the dead gateway magic. I asked about the interface versus policy because I've seen strange things happen with interface based VPNs... like, I've changed attributes via the CLI and they do not reflect if the tunnel is torn down and put back up... You must restart the host process. Although I've not tried the same with policy based. It would be useful to include if the VPNs are route or policy based. – brandeded Sep 10 '13 at 20:36
  • Also have you attempted to debug the policies? – brandeded Sep 10 '13 at 20:37
  • @mbrownnyc, On the Fortigates itself, I'm pretty sure it's a interface based VPN (IPSec with Phase1/2 stuff) and there is a static route between the 2 sites (like this http://docs.forticare.com/fos50hlp/50/index.html#page/FortiOS%25205.0%2520Help/gw-to-gw.082.09.html). We did not changed the Fortigate config for a while. Also, Am I right to say that if it was a problem at the Fortigate level, all Machines in Site 1 wouldn't be able to ping S2-B1 (not just A1) ? – sbrisson Sep 10 '13 at 21:55
  • At this point, if the packets traverses 1) from the client to the fortigate's internal interface (via a route on the client), 2) into the fortigate, into a VPN tunnel carrying the subnet, 3) to the destination, then there should be no problem. My strong suggestion is to debug/packet sniff on the fortigate... also look at ICMP packets received by the client. Do you need these commands? – brandeded Sep 10 '13 at 22:03
  • @mbrownnyc, Yes It would be greatly appreciated if you could give it to me. I will keep it close to me if the problem arise again. – sbrisson Sep 10 '13 at 23:15
  • Here you go: http://mbrownnyc.wordpress.com/2009/03/11/fortigate-debugging-on-a-fortigate/ – brandeded Sep 11 '13 at 00:14
  • i have exactly the same problem. Suddenly one windows pc on esx5.5 (the mgmt and probing pc ) sends requests and they leabe via internet instead of via the route based ipsec vpn . All the other machines ( on the esx server) and some physical machines still show the correct behavior ( = go through the ipsec tunnel for exactly the same destinations as tested on that one pc ) . Very strange . –  Mar 22 '14 at 08:45

2 Answers2

0

We finally found the reason. It is related to Fortigate ICMP session timeout problem. When VPN is down, ICMP session is marked to go via interface directly rather than VPN tunnel. However, when VPN recovers, the session via path is not modified until live time becomes zero. If you keep pinging, the live time can never be zero.

Dave M
  • 4,514
  • 22
  • 31
  • 30
frank
  • 1
0

Add a 'Black Hole' route for the remote subnet with a Lower Priority, so that the packets won;t go the Default Route (Interface). Instead it goes to the Black Hole routes and drops there. When the Tunnel is UP, it resumes to sent the traffic with the Route with Higher Priority.

sekar
  • 1
  • It seems logical. But, my problem has vanished since the upgrade of the Fortigates. I have VPN disconnects sometimes, but the routes recover on its own. – sbrisson Sep 06 '16 at 20:22