Python sock.recv not getting all data from page

Question

this has been very hard step for me learning how to do low level socket communication but I really want to learn this, I've come to a wall and I don't seem to be able to find the proper WAY.

How am I able to get all the data ? I've tried multiple things I'm just able to get partial of the response.

the URL I'm trying right now is:

http://steamcommunity.com/market/search/render/?query=&start=0&count=100&search_descriptions=0&sort_column=price&sort_dir=asc&appid=730&category_730_ItemSet%5B%5D=any&category_730_ProPlayer%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any&category_730_Rarity%5B%5D=tag_Rarity_Ancient_Weapon

After research I tried this way but still wasn't able to print the full JSON page above, anything I'm doing wrong ?

        sock.send(request)
        response = ""
        first = True
        length = 0
        while True:
            partialResponse = sock.recv(65536)
            if len(partialResponse) != 0:
                #print("all %s" % partialResponse)
                # Extract content length from the first chunk
                if first:
                    startPosition = partialResponse.find("Content-Length")
                    if startPosition != -1:
                        endPosition = partialResponse.find("\r\n", startPosition+1)
                        length = int(partialResponse[startPosition:endPosition].split(" ")[1])
                    first = False
                # add current chunk to entire content
                response += partialResponse
                # remove chunksize from chunck
                startPosition = response.find("\n0000")
                if startPosition != -1:
                    endPosition = response.find("\n", startPosition+1)
                    response = response[0:startPosition-1] + response[endPosition+1:]
                if len(response[response.find("\r\n\r\n")+2:]) > length:
                    break
            else:
                break
        print response

If you want the whole thing, why are you asking for a max of 64K bytes? — Patrick Maupin, Sep 06 '15 at 21:41
@ryachza well i've tried printing response to a log and each time, the response is cut short compared to what shows on chrome browser. — Marie Anne, Sep 06 '15 at 22:03

ryachza · Answer 1 · 2015-09-06T23:04:16.080

I was able to duplicate the issue and it seemed that the server was not returning a content-length header, causing the if len(response[..]) > length to trigger with length 0. Changing that statement to if length > 0 and ... seemed to resolve it.

I had to increase the time out I had set from .3 to .5 seconds in order to consistently get responses.

I was receiving a content-length in Chrome, but probably because the content-encoding was gzip. I guess they don't send a content-length for uncompressed responses.

The Content-Length section of this document lists the header as a "SHOULD".

Other general recommendation: I would not assume that the first chunk will always include all headers. There really should be no switch on "first". You should probably read until you encounter the \r\n\r\n that signals the header completion and process that separately with everything following being the response body.

Edit per comment:

For something quick and dirty, I would probably just do this:

response = ''
while True:
    chunk = sock.recv(65536)

    if len(chunk) == 0:
      break
    else:
      response += chunk

pieces = response.split('\r\n\r\n')

headers = pieces[0]
body = '\r\n\r\n'.join(pieces[1:])

print response
print body
print headers

print len(response), len(body), len(headers)

Just rip everything the socket receives into a string and don't try to interpret it at all. This will give you the best chance of getting everything.

I definitely think playing around at this level is a great way to learn and completely worth every moment of the time. That being said, there is a reason libraries are commonly preferred for this sort of thing.

There really isn't a whole lot guaranteed by HTTP - it's very flexible and there are a lot of variables. So you will need to start with essentially no expectations/requirements, and carefully build up constantly thinking about "what if this/that". One thing to be careful of is that the chunking can happen anywhere. A chunk could break before the headers complete, it could even break say in between an \r and \n meaning you would need to parse across chunks to detect boundaries. For common usage, reading the whole response into memory probably isn't a problem, but of course it's possible for certain responses or other requirements to make that impractical/impossible.

Is there an easier way, or more efficient way of receiving all the page without all the conditions I wrote above ? It somewhat feels like too much work. — Marie Anne, Sep 06 '15 at 22:38

Python sock.recv not getting all data from page

1 Answers1