0

I am writing a basic screen scraping script using Mechanize and BeautifulSoup (BS) in Python. However, the problem I am running into is that for some reason the requested page does not download correctly every time. I am concluding this because when searching the downloaded pages using BS for present tags, I get an error. If I download the page again, it works.

Hence, I would like to write a small function that checks to see if the page has correctly downloaded and re-download if necessary (I could also solve it by figuring out what goes wrong, but that is probably too advanced for me). My question is how would I go about checking to see if the page has been downloaded correctly?

Neal Sidd
  • 39
  • 5
  • 2
    Show us the offending code part. Else any suggestion would be to general to be useful to you. – Don Question Jan 31 '12 at 13:57
  • "solve it by figuring out what goes wrong". Good idea. Dump the output from mechanize to see what went wrong. Perhaps that's a better question to ask. – S.Lott Jan 31 '12 at 15:16
  • I used Denis's suggestion to create a small function which checks every page that is downloaded, however that did not work so I dump the output as per S.Lott's suggestion and lo and behold, its a problem with BeautifulSoup... for some reason BS is randomly not finding the tags, even though they are in the document. Recreating the BS object doesn't work either, I have to re-download and then recreate and it works. I'll do some more testing and come back with another question. Thanks guys. – Neal Sidd Feb 01 '12 at 14:42

3 Answers3

0

You can just check for a tag you expect to be there, and if it fails, repeat the download.

page = BeautifulSoup(page)

while page.body = None:
    #redownload the page
    page = BeautifulSoup(page)
#now you can use the data
Rik Smith-Unna
  • 3,465
  • 2
  • 21
  • 21
0

I think you may simple search for html ending tag if this tag is in - this is a valid page.

Denis
  • 7,127
  • 8
  • 37
  • 58
0

The most generic solution is to check that the </html> closing tag exists. That will allow you to detect truncation of the page.

Anything else, and you will have to describe your failure mode more clearly.

Marcin
  • 48,559
  • 18
  • 128
  • 201