2

Having a strange problem with back 2 back connection between an ESX 6 and CentOS 7 machine.

We are using a CentOS 7 directly connected to ESX and we use it as iSCSI NAS - from time to time the ESX says it cannot see the NAS and the corresponding DataStore will be unreachable - when this happens we check everything and nothing physical is wrong, the LED on NIC's are on, ethtool on Linux and ESX report link is OK - when we check for arp, Linux knows the ESX interface, however ESX does not and it's arp cache says incomplete. - when we checked for ARP/RARP packets using tcpdump, something strange happened, in Linux ARP is received from ESX interface and tcpdump shows Linux replies to the ARP request, how every on ESX tcpdump does not have the ARP reply which Linux has sent. - Somehow it seems the link became a one way road!?

plz check the commands and result we did in search for a clue:

On CentOS 7

[root@nas ~]# arp -an
? (10.10.10.2) at 00:50:56:XX:0d:77 [ether] on enp3s6
? (192.168.70.254) at 00:50:56:XX:99:c7 [ether] on enp5s0

[root@nas ~]# tcpdump -nnvli enp3s6 arp
tcpdump: listening on enp3s6, link-type EN10MB (Ethernet), capture size 65535 bytes
07:52:25.143360 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.10.10.1 tell 10.10.10.2, length 46
07:52:25.143367 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.10.10.1 is-at 00:07:e9:XX:07:93, length 28
07:52:26.143452 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.10.10.1 tell 10.10.10.2, length 46
07:52:26.143454 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.10.10.1 is-at 00:07:e9:XX:07:93, length 28
07:52:27.145667 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.10.10.1 tell 10.10.10.2, length 46
07:52:27.145673 ARP, Ethernet (len 6), IPv4 (len 4), Reply 10.10.10.1 is-at 00:07:e9:XX:07:93, length 28

On ESX 6

[root@gahar:~] tcpdump-uw  -nnvli vmk2 arp 
tcpdump-uw: listening on vmk2, link-type EN10MB (Ethernet), capture size 96 bytes
07:52:25.523005 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.10.10.1 tell 10.10.10.2, length 28
07:52:26.523247 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.10.10.1 tell 10.10.10.2, length 28
07:52:27.524461 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.10.10.1 tell 10.10.10.2, length 28
07:52:31.079580 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.10.10.1 tell 10.10.10.2, length 28
07:52:31.079634 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.10.10.1 tell 10.10.10.2, length 28
07:52:32.080746 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.10.10.1 tell 10.10.10.2, length 28
07:52:33.081656 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.10.10.1 tell 10.10.10.2, length 28

[root@gahar:~] ping 10.10.10.1
PING 10.10.10.1 (10.10.10.1): 56 data bytes
sendto() failed (Host is down)
[root@gahar:~] esxcli network ip neighbor list
Neighbor        Mac Address        Vmknic    Expiry  State  Type   
--------------  -----------------  ------  --------  -----  -------
192.168.33.10   00:0c:29:XX:ea:60  vmk0     965 sec         Unknown
192.168.33.254  00:50:56:XX:99:c7  vmk0    1194 sec         Unknown
10.10.10.1      (incomplete)       vmk2      -3 sec         Unknown

Temporary workaround:

[root@gahar:~] esxcli network nic down -n vmnic2
[root@gahar:~] esxcli network nic up -n vmnic2

[root@gahar:~] ping 10.10.10.1
PING 10.10.10.1 (10.10.10.1): 56 data bytes
64 bytes from 10.10.10.1: icmp_seq=0 ttl=64 time=0.207 ms
64 bytes from 10.10.10.1: icmp_seq=1 ttl=64 time=0.212 ms
64 bytes from 10.10.10.1: icmp_seq=2 ttl=64 time=0.257 ms

--- 10.10.10.1 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.207/0.225/0.257 ms

Having all above, I'm looking for a solution. I can't find the root cause.

0 Answers0