I wirite a very simple spider program to fetch webpages from single site.
Here is a minimized version.
from twisted.internet import epollreactor
epollreactor.install()
from twisted.internet import reactor
from twisted.web.client import Agent, HTTPConnectionPool, readBody
baseUrl = 'http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode='
start = 1001
end = 3500
pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = 10
agent = Agent(reactor, pool=pool)
def onHeader(response, i):
deferred = readBody(response)
deferred.addCallback(onBody, i)
deferred.addErrback(errorHandler)
return response
def onBody(body, i):
print('Received %s, Length %s' % (i, len(body)))
def errorHandler(err):
print('%s : %s' % (reactor.seconds() - startTimeStamp, err))
def requestFactory():
for i in range (start, end):
deferred = agent.request('GET', baseUrl + str(i))
deferred.addCallback(onHeader, i)
deferred.addErrback(errorHandler)
print('Generated %s' % i)
reactor.iterate(1)
print('All requests has generated, elpased %s' % (reactor.seconds() - startTimeStamp))
startTimeStamp = reactor.seconds()
reactor.callWhenRunning(requestFactory)
reactor.run()
For a few requests, like 100, it works fine. But for massive requests, it will failed.
I expect all of the requests(around 3000) should be automatically pooled, scheduled and pipelined, since I use HTTPConnectionPool
, set maxPersistentPerHost
, create an Agent
instance with it and incrementally create the connections.
But it doesn't, the connections are not keep-alive nor pooled.
In this programm, it did establish the connections incrementally, but the connections didn't pooled, each connecction will close after body received, and later requests never wait in the pool for an available connecction.
So it will take thousands of sockets, and finally failed due to timeout, because the remote server has a connection timeout set to 30s. Thousands of requests can't be done within 30s.
Could you please give me some help on this?
I have tried my best on this, here is my finds.
- Error occured exactly 30s after reactor start runing, won't be influenced by other things.
- Let the spider fetch my server, I find something interesting.
- The HTTP protocol version is 1.1 (I check the twisted document, the default HTTPClient is 1.0 rather than 1.1)
- If I didn't add any explicit header(just like the minimized version), the request header didn't contain
Connection: Keep-Alive
, either do response header. - If I add explicit header to ensure it is a keep-alive connection, the request header did contain
Connection: Keep-Alive
, but the response header still not. (I am sure my server behave correctly, other stuff like Chrome, wget did receiveConnection: Keep-Alive
header.)
- I check
/proc/net/sockstat
during running, it increase rapidly at first, and decrease rapidly later. (I have increase the ulimit to support plenty of sockets) - I write a similar program with treq, a twisted based request library). The code is almost the same, so not paste here.
- Link: https://gist.github.com/Preffer/dad9b1228fcd75cebd75
- It's behavior is almost the same. Not pooling. It is expected to be pooling as described in treq's feature list.
- If I have add explicit header on it,
Connection: Keep-Alive
never appear in response header.
Based on all of the above, I am quite suspicious about the quirk Connection: Keep-Alive
header ruin the program. But this header is part of HTTP 1.1 standard, and it did report as HTTP 1.1. I am completely puzzled on this.