0

I wirite a very simple spider program to fetch webpages from single site.

Here is a minimized version.

from twisted.internet import epollreactor  
epollreactor.install()
from twisted.internet import reactor
from twisted.web.client import Agent, HTTPConnectionPool, readBody

baseUrl = 'http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode='

start = 1001
end = 3500

pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = 10
agent = Agent(reactor, pool=pool)

def onHeader(response, i):
    deferred = readBody(response)
    deferred.addCallback(onBody, i)
    deferred.addErrback(errorHandler)
    return response

def onBody(body, i):
    print('Received %s, Length %s' % (i, len(body)))

def errorHandler(err):
    print('%s : %s' % (reactor.seconds() - startTimeStamp, err))

def requestFactory():
    for i in range (start, end):
        deferred = agent.request('GET', baseUrl + str(i))
        deferred.addCallback(onHeader, i)
        deferred.addErrback(errorHandler)
        print('Generated %s' % i)
        reactor.iterate(1)

    print('All requests has generated, elpased %s' % (reactor.seconds() - startTimeStamp))

startTimeStamp = reactor.seconds()
reactor.callWhenRunning(requestFactory)
reactor.run()

For a few requests, like 100, it works fine. But for massive requests, it will failed.

I expect all of the requests(around 3000) should be automatically pooled, scheduled and pipelined, since I use HTTPConnectionPool, set maxPersistentPerHost, create an Agent instance with it and incrementally create the connections.

But it doesn't, the connections are not keep-alive nor pooled.

In this programm, it did establish the connections incrementally, but the connections didn't pooled, each connecction will close after body received, and later requests never wait in the pool for an available connecction.

So it will take thousands of sockets, and finally failed due to timeout, because the remote server has a connection timeout set to 30s. Thousands of requests can't be done within 30s.

Could you please give me some help on this?

I have tried my best on this, here is my finds.

  • Error occured exactly 30s after reactor start runing, won't be influenced by other things.
  • Let the spider fetch my server, I find something interesting.
    1. The HTTP protocol version is 1.1 (I check the twisted document, the default HTTPClient is 1.0 rather than 1.1)
    2. If I didn't add any explicit header(just like the minimized version), the request header didn't contain Connection: Keep-Alive, either do response header.
    3. If I add explicit header to ensure it is a keep-alive connection, the request header did contain Connection: Keep-Alive, but the response header still not. (I am sure my server behave correctly, other stuff like Chrome, wget did receive Connection: Keep-Alive header.)
  • I check /proc/net/sockstat during running, it increase rapidly at first, and decrease rapidly later. (I have increase the ulimit to support plenty of sockets)
  • I write a similar program with treq, a twisted based request library). The code is almost the same, so not paste here.
    • Link: https://gist.github.com/Preffer/dad9b1228fcd75cebd75
    • It's behavior is almost the same. Not pooling. It is expected to be pooling as described in treq's feature list.
    • If I have add explicit header on it, Connection: Keep-Alive never appear in response header.

Based on all of the above, I am quite suspicious about the quirk Connection: Keep-Alive header ruin the program. But this header is part of HTTP 1.1 standard, and it did report as HTTP 1.1. I am completely puzzled on this.

Eugene
  • 137
  • 2
  • 9
  • 1
    "Poolling", "pooling", and "polling" are all different things (well, the first isn't a thing at all as far as I know). – Jean-Paul Calderone Aug 28 '14 at 15:35
  • I am sorry for my mistake, all of these should be `pooling`. Any tips for my problem? – Eugene Aug 29 '14 at 00:34
  • I find a similar question, it did solve my problem. http://stackoverflow.com/questions/2861858/queue-remote-calls-to-a-python-twisted-perspective-broker – Eugene Aug 29 '14 at 02:47
  • Whenever you have code snippets on stack overflow, please put them on stack overflow, not hosted on external sites. – Glyph Aug 29 '14 at 03:18
  • @Glyph I'm sorry, I paste the code now. – Eugene Aug 29 '14 at 14:23

2 Answers2

3

I solved the problem myself, with help from IRC and another question in stackoverflow, Queue remote calls to a Python Twisted perspective broker?

In summary, the agent's behavior is very different from that in Nodejs(I have some experience in Nodejs). As it described on Nodejs doc

agent.requests

An object which contains queues of requests that have not yet been assigned to sockets.

agent.maxSockets

By default set to 5. Determines how many concurrent sockets the agent can have open per origin. Origin is either a 'host:port' or 'host:port:localAddress' combination.

So, here is the difference.

  • Twisted:

    • There is no doubt that Agent could queue requests if construct with a HTTPConnectionPool instance.
    • But if a new request is issued after connections in pool has run out, the agent will still create a new connection and perform the request, rather than put it in a queue.
    • Actually, it will lead to drop a connection in the pool, and push the newly generated connection into the pool, keep the connections count still equal to maxPersistentPerHost
  • Nodejs:

    • By default, agent will queue the requests with a implicit connection pool, which have a size of 5 connections.
    • If a new request is issued after connections in pool has run out, the agent will queue the requests into agent.requests variable, waiting for available connection.

The agent's behavior in twisted lead to a result that the agent is able to queue the requests, but actually it doesn't.

Follow our intuition, once assign a connection pool to an agent, it is in line with the intuition that agent will only use the connections in the pool, and wait for available connection if the pool has run out. That is exactly match with the agent in Nodejs.

Personally, I think it is a buggy behavior in twisted, or at least, could make an improvement to provide an option to set agent's behavior.

According to this, I have to use DeferredSemaphore to manually schedule the requests.

I raise a issue to treq project on github, and get similar solution. https://github.com/dreid/treq/issues/71

Here is my solution.

#!/usr/bin/env python
from twisted.internet import epollreactor
epollreactor.install()
from twisted.internet import reactor
from twisted.web.client import Agent, HTTPConnectionPool, readBody
from twisted.internet.defer import DeferredSemaphore

baseUrl = 'http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode='

start = 1001
end = 3500
count = end - start
concurrency = 10
pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = concurrency
agent = Agent(reactor, pool=pool)
sem = DeferredSemaphore(concurrency)
done = 0

def onHeader(response, i):
    deferred = readBody(response)
    deferred.addCallback(onBody, i)
    deferred.addErrback(errorHandler, i)
    return deferred

def onBody(body, i):
    sem.release()
    global done, count
    done += 1
    print('Received %s, Length %s, Done %s' % (i, len(body), done))
    if(done == count):
        print('All items fetched')
        reactor.stop()

def errorHandler(err, i):
    print('[%s] id %s: %s' % (reactor.seconds() - startTimeStamp, i, err))

def requestFactory(token, i):
    deferred = agent.request('GET', baseUrl + str(i))
    deferred.addCallback(onHeader, i)
    deferred.addErrback(errorHandler, i)
    print('Request send %s' % i)
    #this function it self is a callback emit by reactor, so needn't iterate manually
    #reactor.iterate(1)
    return deferred

def assign():
    for i in range (start, end):
        sem.acquire().addCallback(requestFactory, i)

startTimeStamp = reactor.seconds()
reactor.callWhenRunning(assign)
reactor.run()

Is it right? Beg for pointing out my error and improvements.

Community
  • 1
  • 1
Eugene
  • 137
  • 2
  • 9
0

For a few requests, like 100, it works fine. But for massive requests, it will failed.

This is either a protection against web crawlers or a server protection against DoS/DDoS, because you are sending too much requests from the same IP in a short time, so the Firewall or the WSA will block your future request. Just modify your script to make request in batch spaced by some time. you can use callLater() with some time after each X request.

e-nouri
  • 2,576
  • 1
  • 21
  • 36
  • Of course it is possible for this, but later I find it isn't, because Nodejs works all right. Would you please take a look on the answer by myself? – Eugene Aug 29 '14 at 15:35
  • The problem seems to me more simpler than this, for your code test it and let me know if it works, alos I think adding a time between each 100 is more simpler than rebuilding another crawler. Good luck mate! – e-nouri Aug 29 '14 at 16:05
  • Of course it works. Now I realized I must bind `DeferredSemaphore` with `Agent` to make the agent behavior as expected. – Eugene Aug 30 '14 at 03:36