Ensure a page has downloaded correctly in Python

Question

I am writing a basic screen scraping script using Mechanize and BeautifulSoup (BS) in Python. However, the problem I am running into is that for some reason the requested page does not download correctly every time. I am concluding this because when searching the downloaded pages using BS for present tags, I get an error. If I download the page again, it works.

Hence, I would like to write a small function that checks to see if the page has correctly downloaded and re-download if necessary (I could also solve it by figuring out what goes wrong, but that is probably too advanced for me). My question is how would I go about checking to see if the page has been downloaded correctly?

Show us the offending code part. Else any suggestion would be to general to be useful to you. — Don Question, Jan 31 '12 at 13:57
"solve it by figuring out what goes wrong". Good idea. Dump the output from mechanize to see what went wrong. Perhaps that's a better question to ask. — S.Lott, Jan 31 '12 at 15:16
I used Denis's suggestion to create a small function which checks every page that is downloaded, however that did not work so I dump the output as per S.Lott's suggestion and lo and behold, its a problem with BeautifulSoup... for some reason BS is randomly not finding the tags, even though they are in the document. Recreating the BS object doesn't work either, I have to re-download and then recreate and it works. I'll do some more testing and come back with another question. Thanks guys. — Neal Sidd, Feb 01 '12 at 14:42

score 0 · Accepted Answer · answered Jan 31 '12 at 14:48

0

You can just check for a tag you expect to be there, and if it fails, repeat the download.

page = BeautifulSoup(page)

while page.body = None:
    #redownload the page
    page = BeautifulSoup(page)
#now you can use the data

answered Jan 31 '12 at 14:48

Rik Smith-Unna

3,465
2
21
21

score 0 · Answer 2 · answered Jan 31 '12 at 14:54

0

I think you may simple search for html ending tag if this tag is in - this is a valid page.

answered Jan 31 '12 at 14:54

Denis

7,127
8
37
58

score 0 · Answer 3 · answered Jan 31 '12 at 14:54

0

The most generic solution is to check that the </html> closing tag exists. That will allow you to detect truncation of the page.

Anything else, and you will have to describe your failure mode more clearly.

answered Jan 31 '12 at 14:54

Marcin

48,559
18
128
201

Ensure a page has downloaded correctly in Python

3 Answers3