Unable to GET entire page with Python request

Question

I'm trying to get a long JSON response (~75 Mbytes) from a webpage, However I can only receive the first 25 Mbytes or so.

I've used urllib2 and python-requests but neither work. I've tried reading parts in separately and streaming the data, but this doesn't work either.

An example of the data can be found here:

http://waterservices.usgs.gov/nwis/iv/?site=14377100&format=json&parameterCd=00060&period=P260W

My code is as follows:

r = requests.get("http://waterservices.usgs.gov/nwis/iv/?site=14377100&format=json&parameterCd=00060&period=P260W")

usgs_data = r.json() # script breaks here

# Save Longitude and Latitude of river
latitude = usgs_data["value"]["timeSeries"][0]["sourceInfo"]["geoLocation"]["geogLocation"]["latitude"]
longitude = usgs_data["value"]["timeSeries"][0]["sourceInfo"]["geoLocation"]["geogLocation"]["longitude"]

# dictionary of all past river flows in cubic feet per second
river_history = usgs_data['value']['timeSeries'][0]['values'][0]['value']

It breaks with:

ValueError: Expecting object: line 1 column 13466329 (char 13466328)

When the script tries to decode the JSON (i.e. usgs_data = r.json()).

This is because the full data hasn't been received and is therefore not a valid JSON object.

Interesting, it works for me, `r.json()` is not throwing errors.. — alecxe, Dec 27 '15 at 05:42
@alecxe It does seem to work occasionally for me, and other times is errors out. I guess this supports the claim that it is an issue with their server — Ben, Dec 27 '15 at 13:28

mhawke · Accepted Answer · 2015-12-27T05:59:41.507

3

The problem seems to be that the server won't serve more than 13MB of data at a time.

I have tried that URL using a number of HTTP clients including curl and wget, and all of them bomb out at about 13MB. I have also tried enabling gzip compression (as should you), but the results were still truncated at 13MB after decompression.

You are requesting too much data because the period=P260W specifies 260 weeks. If you try setting period=P52W you should find that you are able to retrieve a valid JSON response.

To reduce the amount of data transferred, set the Accept-Encoding header like this:

url = 'http://waterservices.usgs.gov/nwis/iv/'
params = {'site': 11527000, 'format': 'json', 'parameterCd': '00060', 'period': 'P52W'}
r = requests.get(url, headers={'Accept-Encoding': 'gzip,deflate'})

edited Dec 27 '15 at 05:59

answered Dec 27 '15 at 05:51

mhawke

84,695
9
117
138

3

Actually, `requests` sets the `Accept-Encoding: gzip, deflate` header by default, so it should be unnecessary for you to do that. – mhawke Dec 27 '15 at 06:20
Unfortunately I need 260 weeks of data for this project, so I'm a little stuck there. Is there anything I can do on my end to allow more information to be pushed by the server? It does seem to work occasionally. – Ben Dec 27 '15 at 13:34
@Ben: as a workaround I was going to suggest that you use `startDT` and `endDT` parameters to make multiple requests for shorter intervals and then merge the results, however, testing that, even with intervals of just 1 day sometimes results in truncated responses Possibly the requests are being handled by different servers. Try out this idea and if you have trouble with it I think that you might need to take this up with the web service provider. – mhawke Dec 28 '15 at 10:05

Unable to GET entire page with Python request

1 Answers1