6

I want to "ping" a server, check the header response to see if the link is broken, and if it's not broken, actually download the response body.

Traditionally, using a sync method with the requests module, you could send a get request with the stream = True parameter, and capture the headers before the response body download, deciding, in case of error (not found, for example), to abort the connection.

My problem is, doing this with the async libraries grequests or requests-futures has become impossible for my reduced knowdlege base.

I've tried setting the stream parameter to true in request-futures but to no use, it still downloads the response body without letting me intervene as soon as it gets the response headers. And even if it did, I wouldn't be sure of how to proceed.

This is what I've tried:

test.py

from requests_futures.sessions import FuturesSession

session = FuturesSession()
session.stream = True

future = session.get('http://www.google.com')
response = future.result()
print(response.status_code) # Here I would assume the response body hasn't been loaded

Upon debugging I find it downloads the response body either way.

I would appreciate any solution to the initial problem, whether it follows my logic or not.

Community
  • 1
  • 1
undefined
  • 3,949
  • 4
  • 26
  • 38
  • 2
    Could you use `head` instead of `get`? You'd have to do a second call to get the body in the case where you want it. – cco Mar 23 '17 at 02:57
  • @cco though about that, but woudn't it download the headers twice? I realise it wouldn't be much of performance deficiency, but it doesn't exactly feel right. – undefined Mar 25 '17 at 17:58
  • 1
    Yes, but that's what the `HEAD` verb is for - getting the headers w/o the body. I guess it depends on the distribution of cases; if you're mostly reading the body, the extra cost on the server of generating a body you don't read might be OK. If the body is expensive to generate or you mostly don't need it, `HEAD` may be cheaper for the server. – cco Mar 25 '17 at 20:04
  • Sure but my point is, I need to do both based on the links condition, if it's fine, continue with the get request, if it's not, abort the request. And it shouldn't be so much hustle, as the headers arrive way before the body is downloaded. – undefined Mar 25 '17 at 20:06
  • What does "fine" mean? If the response is 404 (or similar) there won't usually be a body anyway. – cco Mar 25 '17 at 20:08
  • Fine would be a non-error response, and most websites have a 404 page, which means it would actually download a body. – undefined Mar 25 '17 at 20:16
  • 1
    Yes, but most 404 pages are small. Trying to avoid reading the body of a 404 (or other error page) strikes me as premature optimization, and unneeded complexity. – cco Mar 25 '17 at 20:22
  • I would agree if i was checking 500 links, but I'm checking 50000 or even ten times that. In any case, I would like to race both versions of the software to see if the performance boost is negligible or if it actually makes a difference. – undefined Mar 25 '17 at 20:31
  • Error page bodies are usually small, and will usually be sent with the headers. If you don't want the body, just don't read it & close the connection. – cco Mar 25 '17 at 20:36
  • Ok, so I am completely unfamiliar with any of these modules but looking at the contents of`grequest.py` shows that grequest.send() should still accept `stream = True` as an argument, maybe that could help you out? – Montmons Mar 28 '17 at 14:12

2 Answers2

2

I believe what you want is an HTTP HEAD request:

session.head('http://www.google.com')

Per w3.org, "the HEAD method is identical to GET except that the server MUST NOT return a message-body in the response." If you like the status code and headers, you can follow-up with a normal GET request.

For the comments, it looks like you might also be interested in doing this in a single request. It is possible to do so directly with sockets. Send the normal GET request, do a recv of the first block, if you don't like the result, close the connection, otherwise loop over the remaining blocks.

Here is a proof of concept of how to download conditionally with a single request:

import socket

def fetch_on_header_condition(host, resource, condition, port=80):
    request =  'GET %s HTTP/1.1\r\n' % resource
    request += 'Host: %s\r\n' % host
    request += 'Connection: close\r\n'
    request += '\r\n'

    s = socket.socket()
    try:
        s.connect((host, port))
        s.send(request)
        first_block = s.recv(4096)
        if not condition(first_block):
            return False, ''
        blocks = [first_block]
        while True:
            block = s.recv(4096)
            if not block:
                break
            blocks.append(block)
        return True, ''.join(blocks)
    finally:
        s.close()

if __name__ == '__main__':
    print fetch_on_header_condition(
        host = 'www.jython.org',
        port = 80,
        resource = '/',
        condition = lambda s: 'Content-Type: text/xml' in s,
    )
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
1

Just check the status from the head request and proceed accordingly:

header = session.head('https://google.com')

if header.ok is True:
    session.get('https://google.com')
mcintosh
  • 11
  • 3