How can i detect and diagnose local network problems on a multi-server environment?

Question

I'm a software engineer who trying to detect (and solve if possible) weird local networking problems since 2 weeks on multi-server hosting environment.

We bought 3 dedicated boxes with 32GB ram 8 core i7 cpu from an european hosting company. Each box has two interfaces one for external traffic and one for local communication. Then we hire a systems engineer to setup our initial environment. What a wonderful world. Everything goes fine until the deployment.. After deployment of the application on servers below problems started:

Server 1 (DB): 32 GB, 8 core, 2 interfaces, running 2 services only: mysql 5.5 on ubuntu 12.04 LTS with memcached 1.4.13-0ubuntu2

Server 2 (www): 32 GB, 8 core, 2 interfaces, running php5-fpm (v5.5), nginx 1.4.4 & crontab on ubuntu 12.04 LTS

Server 3 (Solr): 32 GB, 8 core, 2 interfaces, running one service only: Tomcat7 with Solr 4.5 on ubuntu 12.04 LTS with memcached 1.4.13-0ubuntu2

After deployment we detected that bulk indexing processes of our app was extremely slow. While bulk indexing, app reads data from database (from srv1) (no end-user traffic in stage), processes it and produce more extended data, caching the new data on memcached (srv1) as multiple chunks and indexing on solr. I spent more than 5-6 days on application side to find any possible bottlenecks or app-related problems but nothing found.

When running our indexing cron on the server, application hanging, waiting, sometime throwing connection errors related memcached (NOT FOUND) but sometimes not, passing successfully reading phase and throws another connection exceptions related with mysql connection. Db is up and running, no error lines in mysql.log. Memcached up and running and no error logs event extremely verbose (-vvv) logging is on. I check again and again application, no queries in loops (queries are optimized enough), no unnecesseary memcached connections - operations in loops (we are using multi_get - multi_set methods while bulk reading and writing)

Then i tried to switch my application configuration to use our external ip addresses (120.144.X.X) instead of using local ones (10.10.X.X) and boom! Application started to fly. Problems and exceptions are gone, running perfect like a wind.

Our systengineer digged more and more on hardware/wiring problems, talked many times with datacenter, tested, tested again but final point is: "your hardware and wiring is ok, check your network configuration and your app."

Sysengineer said me that "-ipv6 configuration on local network is unnecessary, so we can completely shut down that" in a meeting. I dont know why. I don't asked any questions more after that dialogue.

Few days later our company hired another sysengineer who hates ipv6 again and i'm very surprised. My first question is, why both sysengineers hates ipv6? What is the problem with ipv6 is?

The main problem with our application is now its talking with memcached and mysql using external ip addresses and we want to use local network for that. It works perfectly on external ip's but not local ones.

I don't know where is the problem, i'm not a sys or network engineer, i don't know what they did in systems but i believe there is a misconfigration issue. Both sysengineers are denied there are nothing wrong but i want to dig this more.

Where can i start? What is the proper tools to find the problem? Is these outputs are normal:

root@10.10.10.4 ~ # ping6 google.com
PING google.com(fra02s20-in-x04.1e100.net) 56 data bytes
64 bytes from fra02s20-in-x04.1e100.net: icmp_seq=1 ttl=56 time=5.46 ms
64 bytes from fra02s20-in-x04.1e100.net: icmp_seq=2 ttl=56 time=5.43 ms
^C
--- google.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 5.432/5.447/5.462/0.015 ms
root@10.10.10.4 ~ # ping6 10.10.10.3
unknown host
root@10.10.10.4 ~ # ping6 10.10.10.1
unknown host
root@10.10.10.4 ~ # ifconfig
eth0      Link encap:Ethernet  HWaddr d4:3d:7e:ec:f0:11  
          inet addr:144.XX.XX.XX  Bcast:144.XX.XX.XX  Mask:255.255.255.224
          inet6 addr: fe80::d63e:7efe:fedf:f011/64 Scope:Link
          inet6 addr: 2c01:4e8:200:7343::2/64 Scope:Global
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3523880 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7026713 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1042946956 (1.0 GB)  TX bytes:9140153208 (9.1 GB)

eth0:1    Link encap:Ethernet  HWaddr d4:3d:7e:ec:f0:11  
          inet addr:144.XX.XX.XXX  Bcast:144.XX.XX.XX  Mask:255.255.255.224
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth1      Link encap:Ethernet  HWaddr 68:05:ca:06:68:a2  
          inet addr:10.10.10.4  Bcast:10.10.10.255  Mask:255.255.255.0
          inet6 addr:fde80::6c05:caff:fe26:57a2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:47434 errors:0 dropped:986 overruns:0 frame:0
          TX packets:364069 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:7188468 (7.1 MB)  TX bytes:527053731 (527.0 MB)
          Interrupt:16 Memory:f7cc0000-f7ce0000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:4765 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4765 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:540280 (540.2 KB)  TX bytes:540280 (540.2 KB)

Where should i go now to find the what the problem is?

EDIT I think these outputs is also interesting:

root@10.10.10.4 # netstat -s | egrep -i 'loss|retrans|drop'
    1588 segments retransmited
    63 times recovered from packet loss by selective acknowledgements
    TCPLostRetransmit: 4
    9 timeouts in loss state
    375 fast retransmits
    46 forward retransmits
    519 retransmits in slow start
    1 SACK retransmits failed


root@10.10.10.1 # netstat -s | egrep -i 'loss|retrans|drop'
    32 dropped because of missing route
    2290 segments retransmited
    2 SYNs to LISTEN sockets dropped
    150 times recovered from packet loss by selective acknowledgements
    TCPLostRetransmit: 5
    4 timeouts in loss state
    410 fast retransmits
    85 forward retransmits
    150 retransmits in slow start
    12 SACK retransmits failed

Is these outputs are really normal?

What DNS is involved in all of this? Have the hosts files been set up so that each server knows each other by name and IP? — Vasili Syrakis, Dec 11 '13 at 04:55
No, hosts files on every box includes only two lines: their own local loopback and external ip addresses with their hostnames. 127.0.0.1 localhost and 144.xx.xxx hostname.of.box — edigu, Dec 11 '13 at 05:09
In summary: Using the external interfaces and the communication is fast; using the internal interfaces and the communication is slow. Does that basically sum it up? — Mark Henderson, Dec 11 '13 at 05:10
It sounds very much to me like the interfaces are assigned backwards. Your current internal interface *sounds* like it's running through some sort of router or something enforcing QoS or rate limiting. But if they're both on the same local networks then there should be no routing, so perhaps it's a switch-based setting on the providers end. It might only be one host that's misconfigured, not both too. — Mark Henderson, Dec 11 '13 at 05:42
Yes they're in same network (also in same rack), hosting company confirmed that. — edigu, Dec 11 '13 at 05:48
Are the "local" network interfaces dropping traffic in hardware? Check with `ethtool -S ethX` and look for anything that looks like a drop or discard or error counter. — suprjami, Dec 21 '13 at 23:42

How can i detect and diagnose local network problems on a multi-server environment?

0 Answers0