2

There is this really weird issue on the internal network, this happens constantly to "random" machines. Machine (client) boots up, goes to some-intern-website-which-is-not-accessible-from-internet.company-name.tld, timeout, retry, timeout, ping the url/host - machine is alive, go to website again - it loads succesfully. I've triple checked the firewall rules on the gateway (but how can it have anything to do with it when switches directly connect internal computers together?), checked the firewall on webserver - ACCEPT everything. both -S and -L (iptables). tracert uses ping, so the host is alive and well when I try to do that. I'm out of ideas and have no way how to troubleshoot this. VPN users also seem to be affected. How do I approach this issue? The webserver is a debian domu xen vm om xenserver. It started after we moved the company to a new location, thrown away a few old switches, bought new ones. Nothing too fancy. Could it be the switches?

Edit:

Per request I have created a simple drawing of the network infrastructure:

enter image description here

And also pfSense and firewall configurations:

  • All sysctl values default
  • Hostname Portia
  • Domain EHV (same as windows 2012 AD)
  • DNS Servers 127.0.0.1, 8.8.8.8, 8.8.4.4, wan gateway
  • AD also as LDAP auth server configured into pfSense
  • EnableReflectionPureNat: YES
  • hadp mode
  • re0 - WAN, re1 - LAN (see image for ip+subnet)
  • WAN -> blockbogons only
  • DNS on LAN's (and VPN): Range from 10.x.0.1 to 10.x.255.255 with gateway 10.0.0.1 and wins server 10.0.0.2
  • x - LAN = 1, VPN = 2

FW Rules:

enter image description here enter image description here

No floating rules (0)

NAT Rules:

1:1 and NPt empty Outbound default/auto enter image description here

Aliases:

enter image description here

Restarting everything doesn't seem to help.

Aquired packet logs with tcpdump and wireshark on both client and server, client sends syn, server wants to send ack, ack doesn't get to client. server keeps retrying. server can't also ping client. if client pings server, server can communicate with client for an unknown (longer than 1H or as long as the machine is up?) amount of time. Some other pc's (particularily behind other switches) cannot ping client too.

Edit 2: this seems to be happening for all tcp connections (as I just had the same with SSH).

Edit 3: I've managed to somehow isolate the issue, I think?

I changed the ip from a pc from 10.[not zero].x.x to 10.0.x.x and ... behold it works! Why? Why can't we have a 10.1 10.2 etc network until the machines ping each other? How can I determine the culprit?

Gizmo
  • 289
  • 2
  • 11
  • You need to provide a bit more information. Make model of switches, subnets, vlans, etc. – Drifter104 Nov 24 '15 at 22:35
  • Replicate the failure scenario and troubleshoot. Is the routing table correct? Is the next hop in the ARP cache? Does a SYN go out? Etcetera. – David Schwartz Nov 25 '15 at 11:20
  • seems the server us trying to reply with a syn ack, error, retransmission, loop, I've been doing wireshark/tcpdump packet collections for the past few hours and that's the only thing I'm wiser. Checked routing tables, configs, firewalls, restarted the whole power grid in the building, (ofc first powering off all devices). – Gizmo Nov 25 '15 at 18:19
  • Alright added as much info as I can to the question. – Gizmo Nov 25 '15 at 19:37
  • Well I figured it out though I don't really want to accept a solution like "change your dhcp from 10.y.x.x to some smaller range in 10.0.y.x" :( That's not a realy solution, just a workaround to the problem? Should I upgrade the switches to managed ones? All of them? Or do I blame some other device? The NIC on the xenserver machine? pfSense bug? How do I determine this? – Gizmo Nov 25 '15 at 20:30
  • 1
    By trial and error. Its paramount to make one change at a time. It could be the switch, you need another switch, even a cheap one, to perform tests, or remove the switch all together see if the client works as expected. – atmosx Nov 25 '15 at 21:11
  • Yeah atmosx is right, this really sounds like an Arp issue. But as all your switches are unmanaged your totally blind so trial and error to eliminate one switch at a time is your only option – Nath Nov 25 '15 at 21:17
  • Network classes are dead, killed in 1993 by RFCs 1518 and 1519, which defined CIDR (_Classless_ Inter-Domain Routing). Modern networking doesn't use network classes. Please let them rest in peace. – Ron Maupin Jun 09 '17 at 22:08

2 Answers2

2

We had a problem similar to this issue that we solved today. A website hosted on one of our VMs would only be accessible after we pinged it. We figured out that the VM's NIC was set so it could go to sleep to conserve power. We changed the setting, and the website stayed up, no more pinging it first to be able to access it. I hope this helps someone!

0

I had the same issue turns out my webserver had the wrong subnet mask configured. After correcting that the issue was solved.

  • Welcome to serverfault. Please include more details how to check if the subnet mask is wrong and how to fix it – rafalmag Nov 05 '22 at 18:34