2

I have the following code to open and read URLs:

html_data = urllib2.urlopen(req).read()

and I believe this is the most standard way to read data from HTTP. However, when the response have chunked tranfer-encoding, the response starts with the following characters:

1eb0\r\n2625\r\n
<?xml version="1.0" encoding="UTF-8"?>
...

This happens due to the mentioned above chunked encoding and thus my XML data becomes corrupted.

So I wonder how I can get rid of all meta-data related to the chunked encoding?

dragoon
  • 5,601
  • 5
  • 37
  • 55

3 Answers3

1

I ended up with custom xml stripping, like this:

    xml_start = html_data.find('<?xml')
    xml_end = html_data.rfind('</mytag>')
    if xml_start !=0:
        log_user_action(req.get_host() ,'chunked data', html_data, {})
        html_data = html_data[xml_start:]
    if xml_end != len(html_data)-len('</mytag>')-1:
        html_data = html_data[:xml_end+1]

Can't find any simple solution.

dragoon
  • 5,601
  • 5
  • 37
  • 55
0

1eb0\r\n2625\r\n is the segment start/stop positions (in hex) in the reassembled payload

FirefighterBlu3
  • 399
  • 5
  • 11
  • This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post - you can always comment on your own posts, and once you have sufficient [reputation](http://stackoverflow.com/faq#reputation) you will be able to [comment on any post](http://stackoverflow.com/privileges/comment). – Steven Rumbalski Nov 16 '12 at 16:38
  • @StevenRumbalski my comment is specific to the remark about removing the meta data. it's unwise to simply ignore those values and assume what you get in the payload is then the expected data. the metadata should be used to verify that the payload you have, matches the chunk[s] you intend to operate over. in otherwords, don't just naively attempt to find the xml and ignore the chunking information, or you may end up with malformed blocks of data due to things being out of order or missing. use the metadata to ensure correct chunk reassembly. – FirefighterBlu3 Feb 26 '13 at 06:47
-1

You can remove everything before ?xml

html_data = html_data[html_data.find('<?xml'):]
Josh
  • 1
  • Unfortunately I can't. Chunked encoding also adds some meta-data after payload. I've given the beginning of my dump just as an example. – dragoon Aug 28 '11 at 14:12