Parsing responses of content-type chunked in python

Question

I'm trying to read and parse a request of content-type: chunked in python. Here is what I see when I load the url in a browser and look at the source:

<!-- ---------------------------------------------------------------- http://github.com/Atmosphere ------------------------------------------------------------------------ --> 
<!-- Welcome to the Atmosphere Framework. To work with all the browsers when suspending connection, Atmosphere must output some data to makes WebKit based browser working.--> 
<!-- --------------------------------------------------------------------------------------------------------------------------------------------------------------------- --> 
<!-- EOD -->[{"__publicationName":"dip\/acc\/LHC\/Beam\/Intensity\/Beam2","value":"2.505730663333334E9"},  {"__publicationName":"dip\/acc\/LHC\/Beam\/Intensity\/Beam1","value":"1.5584484E9"},{"__publicationName":"dip\/acc\/LHC\/Beam\/Energy","value":"495"},

I'd like to retrieve and parse the json entries like this one:

{"__publicationName":"dip\/acc\/LHC\/Beam\/Intensity\/Beam2","value":"2.505730663333334E9"}

How can I do this?

Thanks

score 1 · Answer 1 · answered Sep 15 '11 at 00:22

"chunked" is not a valid content-type, although it is a valid transfer-encoding. Based on the sample you've posted, that doesn't really look like your problem. This looks like a header applied to a regular jsonp response. In many cases, the sgml comments would be ignored by a browser, but you'll have to extract it manually for your own use. Here's an idea of dealing with that:

>>> import json
>>> corpus = '''<!-- ---------------------------------------------------------------- http://github.com/Atmosphere ------------------------------------------------------------------------ --> 
... <!-- Welcome to the Atmosphere Framework. To work with all the browsers when suspending connection, Atmosphere must output some data to makes WebKit based browser working.--> 
... <!-- --------------------------------------------------------------------------------------------------------------------------------------------------------------------- --> 
... <!-- EOD -->[{"__publicationName":"dip\/acc\/LHC\/Beam\/Intensity\/Beam2","value":"2.505730663333334E9"},  {"__publicationName":"dip\/acc\/LHC\/Beam\/Intensity\/Beam1","value":"1.5584484E9"},{"__publicationName":"dip\/acc\/LHC\/Beam\/Energy","value":"495"}]'''
>>> junk, data = corpus.split('<!-- EOD -->', 1)
>>> parsed = json.loads(data)
>>> for item in parsed:
...     print item
... 
{u'__publicationName': u'dip/acc/LHC/Beam/Intensity/Beam2', u'value': u'2.505730663333334E9'}
{u'__publicationName': u'dip/acc/LHC/Beam/Intensity/Beam1', u'value': u'1.5584484E9'}
{u'__publicationName': u'dip/acc/LHC/Beam/Energy', u'value': u'495'}

and by 'an idea' of how it goes, he means a fully working solution.. well played, TokenMacGuy. — Profane, Sep 15 '11 at 01:42
well, it may or may not work in general; it looks like the posted example is only a fragment of the response (the suffix is not valid json since it ends on a `,`). I'm also not so sure that there's any variability in what the api actually returns; This example might work, it looks like it works based on the example, but I don't really *know* that it's a full, working solution. — SingleNegationElimination, Sep 15 '11 at 01:45
Hi, thanks for your help and solutions... very helpful. I have a problem though: if I do `foo = urllib2.urlopen('http://...')` and then `foo.read()` it gets stucked because the connexion stays open. If I do `foo.read(1000)` then response is chunked randomly and I can end up with something like this `{u'__publicationName': u'dip/ac`. How can I avoid this? Thanks! — Alex, Sep 15 '11 at 10:22

Parsing responses of content-type chunked in python

1 Answers1