3

I have two servers (let me name them A and B).

Facts:

  • They have same CPU, memory, motherboard, hard drive, uplink speed.
  • They are both on Ubuntu 12.04 with Python 2.7.3 and Django latest revision.
  • They also locate in the same data center with same name server setup.
  • They have similar ping & traceroute results to name servers.

Server A works fine. My problem is Server B is very slow when using python to connect to the internet.

Below is the tests I did on both servers (domain_list_1 and domain_list_2 are two lists containing 100 unique domains in each list):

Test One:

starttime = time.time()
for domain in domain_list_1:
    ip = socket.gethostbyname(domain)
print '%.1f items per second' % (100/(time.time()-starttime))
>> Server A Results: 3.3 items per second
>> Server B Results: 0.7 items per second

Test Two:

starttime = time.time()
for domain in domain_list_2:
    os.system('nslookup %s > /dev/null' % domain)
print '%.1f items per second' % (100/(time.time()-starttime))
>> Server A Results: 3.3 items per second
>> Server B Results: 3.3 items per second

As you may see from Test Two, networking on Server B has no problem.

I did similar tests with urllib2 and results is the same (Server A is ok but Server B is slower using urllib2 than using wget or curl to do the same job). So I believe it's a Python problem. I just don't know what went wrong with the Python setup on server B.

Is there a way I can profile into the internal process and find out which part of the code slow down the whole process?

Thank you in advance!

jack
  • 17,261
  • 37
  • 100
  • 125
  • 3
    "connect to the internet" is very different from what you're doing, which is "looking up DNS entries by name". Your name resolution system is probably misconfigured on server B. This is unlikely to be a Python problem, since Python is just calling the OS. – Greg Hewgill Nov 09 '12 at 02:35
  • @Greg, If name resolution on OS level has problem, how to explain nslookup is much faster than Python's socket.gethostbyname()? – jack Nov 09 '12 at 03:15
  • 1
    nslookup bypasses your local resolver library, since it is specific to DNS (remember, names can be resolved in more than one way, just one of which is DNS). Check `/etc/nsswitch.conf` for how hostnames are looked up. – Greg Hewgill Nov 09 '12 at 03:21
  • @Greg, /etc/nsswitch.conf on A & B are exactly the same. – jack Nov 09 '12 at 03:27
  • 1
    Well what does the line for `hosts:` say? The services listed on that line are what the resolver library checks for hostname lookup. One of the entries is probably `dns`, but there might be others (that could be causing your problem). You can also run your program under `strace` and see where the extra delay is (but it might be hard to track down that way). – Greg Hewgill Nov 09 '12 at 03:27
  • @Greg, both servers shows "hosts: files mdns4_minimal [NOTFOUND=return] dns mdns4" in /etc/nsswitch.conf. BTW, how to find out valuable information from the strace ouput? Both tests generates a few MBs text output. – jack Nov 09 '12 at 03:54
  • I would get rid of that mdns4 stuff unless you really know you need it (if these are servers, probably not). See this for somebody with a similar problem, solved by using just `hosts: files dns`: https://bugs.launchpad.net/ubuntu/+source/nss-mdns/+bug/94940 – Greg Hewgill Nov 09 '12 at 04:07
  • @Greg, I removed mdns4 but the performance did not improved. – jack Nov 09 '12 at 04:21
  • Did you remove *everything* related to mdns4, so the line reads just `hosts: files dns` without the `mdns4_minimal` and the `[NOTFOUND=return]` stuff? – Greg Hewgill Nov 09 '12 at 04:23
  • @Greg, yes, the line is now "hosts: files dns" on both servers. – jack Nov 09 '12 at 04:39
  • I'm afraid I'm out of ideas. You might try asking this over on http://askubuntu.com or http://serverfault.com as it does not appear to be related to programming (it's almost certainly a configuration problem of some kind and might even be specific to Ubuntu). Good luck! – Greg Hewgill Nov 09 '12 at 04:42
  • @Greg, I found a solution after comparing the strace output. Thank you for your advice. – jack Nov 09 '12 at 06:39

1 Answers1

2

Based on the suggestions given by Greg, I looked into the strace output and found the following:

Server A:

12879 21:29:24.182590 connect(5, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("206.251.73.9")}, 16) = 0 <0.000035>
12879 21:29:24.182694 poll([{fd=5, events=POLLOUT}], 1, 0) = 1 ([{fd=5, revents=POLLOUT}]) <0.000018>
12879 21:29:24.182778 sendto(5, "'!\1\0\0\1\0\0\0\0\0\0\njanadrakka\3com\0\0\1\0\1", 32, MSG_NOSIGNAL, NULL, 0) = 32 <0.000040>
12879 21:29:24.182881 poll([{fd=5, events=POLLIN}], 1, 5000) = 1 ([{fd=5, revents=POLLIN}]) <0.067000>
12879 21:29:24.249987 ioctl(5, FIONREAD, [130]) = 0 <0.000022>
12879 21:29:24.250100 recvfrom(5, "'!\201\200\0\1\0\1\0\2\0\2\njanadrakka\3com\0\0\1\0\1"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("206.251.73.9")}, [16]) = 130 <0.000032>
12879 21:29:24.250287 close(5)          = 0 <0.000053>

Server B:

4850  21:28:55.501276 connect(5, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("206.251.73.9")}, 16) = 0 <0.000019>
4850  21:28:55.501348 poll([{fd=5, events=POLLOUT}], 1, 0) = 1 ([{fd=5, revents=POLLOUT}]) <0.000014>
4850  21:28:55.501419 sendto(5, "\346\10\1\0\0\1\0\0\0\0\0\0\fdeghatgostar\3com\0\0\1"..., 34, MSG_NOSIGNAL, NULL, 0) = 34 <0.000036>
4850  21:28:55.501506 poll([{fd=5, events=POLLIN}], 1, 5000) = 1 ([{fd=5, revents=POLLIN}]) <0.615731>
4850  21:28:56.117335 ioctl(5, FIONREAD, [129]) = 0 <0.000033>
4850  21:28:56.117429 recvfrom(5, "\346\10\201\200\0\1\0\1\0\2\0\2\fdeghatgostar\3com\0\0\1"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("206.251.73.9")}, [16]) = 129 <0.000011>
4850  21:28:56.117499 close(5)          = 0 <0.000009>

The delay was happened on this system call:

A: 12879 21:29:24.182881 poll([{fd=5, events=POLLIN}], 1, 5000) = 1 ([{fd=5, revents=POLLIN}]) <0.067000>

B: 4850 21:28:55.501506 poll([{fd=5, events=POLLIN}], 1, 5000) = 1 ([{fd=5, revents=POLLIN}]) <0.615731>

Solutions:

It seems the delay was caused by IPv6 dns lookup on Server B. However, I am still not sure about why Server A does not have such problem but the following changes made on Server B solve it.

Add the following lines to /etc/sysctl.conf and reboot the server.

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Finally, thanks to Greg for your advice.

jack
  • 17,261
  • 37
  • 100
  • 125