Python: retrieve several URLs via select.epoll()

Question

I have an event oriented server which already uses select.epoll().

Now a new requirement should be solved: URLs should get fetched (async).

Up to now I always used the requests library, and I always used it synchronous, never asynchronous.

How can I use the requests library (or a different urllib) combined with linux epoll?

The requests library docs has a note about this, but there only async-frameworks are mentioned (not select.epoll()): http://docs.python-requests.org/en/master/user/advanced/#blocking-or-non-blocking

I am not married with select.epoll(). It worked up to now. I can use a different solution, if feasible.

Background: The bigger question is "Should I use select.epoll() or one of the many async frameworks which python has?". But questions at StackOverflow must not be too broad. That's why this question focuses on "Retrieve several URLs via select.epoll()". If you have hints to the bigger question, please leave a comment.

If you are curious, this question is needed for a small project which I develop in my spare time: https://github.com/guettli/ipo (IPO is an open source asynchronous job queue which is based on PostgreSQL.)

@georgexsh here is how my event loop works: https://github.com/guettli/ipo/blob/master/ipo/management/commands/ipo_server.py#L17 — guettli, Jan 04 '18 at 15:16
To your bigger question,the principle of IO is Polling for high speed IO and interruption for slow IO. — obgnaw, Jan 05 '18 at 10:12
@obgnaw you say "the principle of IO is Polling for high speed IO and interruption for slow IO". I would like to optimize later. How can I know if the connection to an URL is slow or high speed? In my case the URLs will be from servers which are very close to the daemon. Thank you for the hint. I guess I will start with epoll() first. What do you think? — guettli, Jan 05 '18 at 10:59
@guettli what python versions does your project need to support? — Oleg Kuralenko, Jan 07 '18 at 20:35
@ffeast it must support current Python3, and Python2.7 support would be nice. — guettli, Jan 08 '18 at 10:00
Redis [is a persistent storage](https://redis.io/topics/persistence). So the need for the package may be based on a wrong argument (*Why reinvent and not reuse?* in your README.rst). — saaj, Jan 11 '18 at 12:26
@saaj thank you for your feedback. I updated the README: https://github.com/guettli/ipo/ — guettli, Jan 11 '18 at 14:39
Also note that [database-as-IPC](https://en.wikipedia.org/wiki/Database-as-IPC), and [RDBMS-as-a-queue](https://www.engineyard.com/blog/5-subtle-ways-youre-using-mysql-as-a-queue-and-why-itll-bite-you) in particular, is a recognised anti-pattern. You'll likely have issues with performance due to polling and locking. On Redis you can build a reliable queue. You don't need full ACID for it. Redis is single-threaded, has limited transaction support and also Lua scripts' execution is atomic. E.g. I wrote a reliable queue package [Torrelque](https://pypi.python.org/pypi/Torrelque) with these tools. — saaj, Jan 12 '18 at 10:22
@saaj yes, I know that this is an anti-pattern ... up to now. You are talking about polling and locking. I don't see this in my case (PosgreSQL LISTEN/NOTIFY). But maybe I am blind. You say "You don't need full ACID for it". I want it all: ACID, zero costs and a lot of fun. And up to now I do know why I should not. I am sure that performance won't be a problem. Thank you for the hint to Torrelque, Up to now I only knew python-rq. — guettli, Jan 12 '18 at 12:11
Oh, I didn't know of LISTEN/NOTIFY SQL extension. That seems to rid from polling overhead. And maybe alleviates read/write contention, depending on the load. Anyway, if it works for you and load tests show you have enough room to grow, don't bother with general precautions. Just in case, `aiopg` [supports notifications](http://aiopg.readthedocs.io/en/stable/core.html#server-side-notifications) which should give much better maintainability than interacting with `select` that `psycopg2` suggests. — saaj, Jan 12 '18 at 13:04
@saaj aiopg looks good. Unfortunately I still need to support Python2.7 for some months. At the moment I use threads. But this could get improved and refactored later without any outside noticeable change. — guettli, Jan 12 '18 at 13:28

score 3 · Accepted Answer · answered Jan 08 '18 at 14:25

How can I use the requests library (or a different urllib) combined with linux epoll?

Unfortunately you can’t unless such a library has been built with this integration in mind. epoll, as well as select/poll/kqueue and others are I/O multiplexing system calls and the overall program architecture needs to be built around it.

Simply put, a typical program structure boils down to the following

one needs to have a bunch of file descriptors (sockets in non-blocking mode in your case)
a system call (man epoll_wait in case of epoll) blocks until a specified event occurs on one or multiple descriptors
information of the descriptors available for I/O is returned

After that this is the outer code’s job to handle these descriptors i.e. figure out how much data has become available, call some callbacks etc.

If the library uses regular blocking sockets the only way to parallelize it is to use threads/processes Here’s a good article on the subject, the examples use C and that’s good as it’s easier to understand what’s actually happening under the hood

Async frameworks & requests library

Lets check out what’s suggested here

If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python's asynchronicity frameworks. Some excellent examples are requests-threads, grequests, and requests-futures).

requests-threads - uses threads

grequests - integration with gevent (it’s a different story, see below)

requests-futures - in fact also threads/processes

neither of them has anything to do with true asynchronicity

Should I use select.epoll() or one of the many async frameworks which python has

Please note, epoll is linux-specific beast and it won’t work i.e. on OS X that has a different mechanism called kqueue. As you appear to be writing a general-purpose job queue it doesn’t seem to be a good solution.

Now back to python. You’ve got the following options:

threads/processes/concurrent.futures - unlikely is it something you’re aiming at as your app is a typical C10K server

epoll/kqueue - you’ll have to do everything yourself. In case of fetching an HTTP urls you’ll need to deal with not only http/ssl but also with asynchronous DNS resolution. Also consider using asyncore[] that provides some basic infrastructure

twisted/tornado - callback-based frameworks that already do all the low-level stuff for you

gevent - this is something you might like if you’re going to reuse existing blocking libraries (urllib, requests etc) and use both python 2.x and python 3.x. But this solution is a hack by design. For an app of your size it might be ok, but I wouldn’t use it for anything bigger that should be rock-solid and run in prod

asyncio

This module provides infrastructure for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives

It has everything you might need. There’s also a bunch of libraries working with popular RDBMs and http https://github.com/aio-libs

But it lacks support of python 2.x. There are ports of asyncio to python 2.x but not sure how stable they are

Finally

So if I could sacrifice python 2.x I’d personally go with asyncio & related libraries

If you really really need python 2.x use one of the approaches above depending on the stability required and assumed peak load

thank you for your in depth answer. My current feeling is to go with asyncio and drop Python2 support. — guettli, Jan 08 '18 at 14:38
I am not married with select.epoll(). It worked up to now. I can use a different solution, if feasible. — guettli, Jan 08 '18 at 14:41

score 2 · Answer 2 · edited Jan 10 '18 at 18:08

when doing high performance development,we always choose weapons based on our situation.So it still too broad to answer.

But your bigger question is a easier one.only the IO-bound program is suit for Async.

what is the purpose of epoll and asynchronous?Avoiding the CPU waiting for IO and doing nothing.CPU waiting for IO blocks,IO blocks because NO DATA TO READ or NO space to write.

Buffer is introduced to reduce the system call.When you call read on a stream,you actually read from the buffer.(concepts,not very accurate)

Select or epoll are nonblocking busy polling(epoll implement by interruption underlying).it just essentially something like below

while true {
  for i in stream[]{
    if i has data
          read until unavailable
    }
}

it's silly,so there is select and epoll. Everytime you read from buffer,there are data waiting for you,it's high speed IO,then epoll/select is your best choice.And when the buffer is always empty,it's a slow stream,IO-bound,async is very suit for this situation.

I don't know async very well,for me it's just soft interruption internally and a lot of callback.

Charles Pehlivanian · Answer 3 · 2018-01-11T21:41:05.423

The main point above is correct, you cannot technically do this with a blocking call meant for multiplexed I/O such as select(), epoll(), and the BSD/iOS, Windows variants. These calls allow a timeout specification, so you can come close by repeated polling on short intervals, then passing work to an asynch handler off of the main thread. In that case, the reading is done on the main thread, multiple reads can signal that they're ready, and the main thread is primarily devoted to that task.

If the scale of your problem is small to medium then nothing is going to beat an epoll()...read() or even select()...read(). If your problem (number of read channels) is on the small side. So I'd encourage you to think about that - get as much work off the main thread which can be devoted to the requests.

If you are looking for an async solution, one of your best options is the grequests library, both for ease of use and performance. To get an idea, run the following client-server pair. Note that the use of tornado is irrelevant here and only on the server side whereas your concern is the client.

Try this - the performance difference is night and day.

A solution for you is represented by the client.py class below; it uses grequests to issue get() requests asynchronously.

server.py

from tornado import (httpserver, options,
                     ioloop, web, gen)
import time

import ujson as json
from collections import defaultdict

class Check(web.RequestHandler):

    @gen.coroutine
    def get(self):
        try:
            data = int(self.get_argument('data'))
        except ValueError:
            raise web.HTTPError(400, reason='Invalid value for data')

        delay = 100
        start = time.time()
        print('Processed: {!r}'.format(data))

       yield gen.Task(ioloop.IOLoop.instance().add_timeout, start + delay / 1000.)

        self.write('.')
        end = time.time()
        self.finish()


if __name__ == '__main__':
    port = 4545

    application = web.Application([
        (r'/get', Check)
        ])

    http_server = httpserver.HTTPServer(application)
    http_server.listen(port)
    print('Listening on port: {}'.format(port))
    ioloop.IOLoop.instance().start()

client.py

import grequests
from tornado.httpclient import HTTPClient
import time

def call_serial(num, httpclient):
    url = 'http://127.0.0.1:4545/get?data={}'.format(num)
    response = httpclient.fetch(url)
    print('Added: {!r}'.format(num))

def call_async(mapper):
    futures = (grequests.get(url) for url,_ in mapper)
    responses = grequests.map(futures)
    for response, (url,num) in zip(responses, mapper):
        print('Added: {!r}'.format(num))

def check(num):
    if num % 2 == 0:
        return False
    return True

def serial_calls(httpclient, up_to):
    for num in range(up_to):
        if check(num):
            call_serial(num, httpclient)

def async_calls(httpclient, up_to):
    mapper = []

    for num in range(up_to):
        if check(num):
            url = 'http://127.0.0.1:4545/get?data={}'.format(num)    
            mapper.append((url,num))

    call_async(mapper)


if __name__ == '__main__':

    httpclient = HTTPClient()

    print('SERIAL CALLS')
    serial_calls(httpclient, 100)

    print('ASYNC CALLS')
    async_calls(httpclient, 100)
    httpclient.close()

This is a true async solution, or as close as one can get in CPython/python. No pollers used.

Just to make sure - Tornado is on the server side, only necessary for the server of the server-client pair. The logic that issues the `get()` requests uses grequests which I've found to be a good if not the best choice. — Charles Pehlivanian, Jan 11 '18 at 15:19
about grequests: "Note: You should probably use requests-threads or requests-futures instead." from https://github.com/kennethreitz/grequests. Why does the author suggest to use something else? — guettli, Jan 12 '18 at 12:24
I did not notice that as I installed with `pip`. Let me think about that. — Charles Pehlivanian, Jan 13 '18 at 01:18

Python: retrieve several URLs via select.epoll()

3 Answers3

How can I use the requests library (or a different urllib) combined with linux epoll?

Async frameworks & requests library

Should I use select.epoll() or one of the many async frameworks which python has

Finally