Python urllib2 decode chunked encoding

Question

I have the following code to open and read URLs:

html_data = urllib2.urlopen(req).read()

and I believe this is the most standard way to read data from HTTP. However, when the response have chunked tranfer-encoding, the response starts with the following characters:

1eb0\r\n2625\r\n
<?xml version="1.0" encoding="UTF-8"?>
...

This happens due to the mentioned above chunked encoding and thus my XML data becomes corrupted.

So I wonder how I can get rid of all meta-data related to the chunked encoding?

What happens when you try to load the source data in a web browser? Do you get the 1eb0 or 2625? And are those (and other) numbers consistent? — chaimp, Aug 29 '11 at 03:15

score 1 · Accepted Answer · answered Sep 04 '11 at 12:15

I ended up with custom xml stripping, like this:

    xml_start = html_data.find('<?xml')
    xml_end = html_data.rfind('</mytag>')
    if xml_start !=0:
        log_user_action(req.get_host() ,'chunked data', html_data, {})
        html_data = html_data[xml_start:]
    if xml_end != len(html_data)-len('</mytag>')-1:
        html_data = html_data[:xml_end+1]

Can't find any simple solution.

score 0 · Answer 2 · answered Sep 19 '12 at 23:44

0

1eb0\r\n2625\r\n is the segment start/stop positions (in hex) in the reassembled payload

answered Sep 19 '12 at 23:44

FirefighterBlu3

399
5
11

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post - you can always comment on your own posts, and once you have sufficient [reputation](http://stackoverflow.com/faq#reputation) you will be able to [comment on any post](http://stackoverflow.com/privileges/comment). – Steven Rumbalski Nov 16 '12 at 16:38
@StevenRumbalski my comment is specific to the remark about removing the meta data. it's unwise to simply ignore those values and assume what you get in the payload is then the expected data. the metadata should be used to verify that the payload you have, matches the chunk[s] you intend to operate over. in otherwords, don't just naively attempt to find the xml and ignore the chunking information, or you may end up with malformed blocks of data due to things being out of order or missing. use the metadata to ensure correct chunk reassembly. – FirefighterBlu3 Feb 26 '13 at 06:47

score -1 · Answer 3 · answered Aug 28 '11 at 14:08

-1

You can remove everything before ?xml

html_data = html_data[html_data.find('<?xml'):]

answered Aug 28 '11 at 14:08

Josh

1

Unfortunately I can't. Chunked encoding also adds some meta-data after payload. I've given the beginning of my dump just as an example. – dragoon Aug 28 '11 at 14:12

Python urllib2 decode chunked encoding

3 Answers3