0

I'm writing a script to DL the entire collection of BBC podcasts from various show hosts. My script uses BS4, Mechanize, and wget.

I would like to know how I can test if a request for a URL yields a response code of '404' form the server. I have wrote the below function:

def getResponseCode(br, url):
    print("Opening: " + url)
    try:
        response = br.open(url)
        print("Response code: " + str(response.code))
        return True
    except (mechanize.HTTPError, mechanize.URLError) as e:
        if isinstance(e,mechanize.HTTPError):
            print("Mechanize error: " + str(e.code))
        else:
            print("Mechanize error: " + str(e.reason.args))
        return False

I pass into it my Browser() object and a URL string. It returns either True or False depending on whether the response is a '404' or '200' (well actually, Mechanize throws and Exception if it is anything other than a '200' hence the Exception handling).

In main() I am basically looping over this function passing in a number of URLs from a list of URLs that I have scraped with BS4. When the function returns True I proceed to download the MP3 with wget.

However. My problem is:

  • The URLs are direct path to the podcast MP3 files on the remote server and I have noticed that when the URL is available, br.open(<URL>) will hang. I suspect this is because Mechanize is caching/downloading the actual data from the server. I do not want this because I merely want to return True if the response code is '200'. How can I not cache/DL and just test the response code?

I have tried using br.open_novisit(url, data=None) however the hang still persists...

Yaron
  • 233
  • 1
  • 14
uncle-junky
  • 723
  • 1
  • 8
  • 33
  • Why are you using Mechanize? Are you trying to simulate an actual browser? If so, why? Meanwhile, why are you trying to download a file just to see if wget can download it, instead of just downloading it? And why are you using wget? – abarnert Nov 29 '13 at 00:01
  • "Why are you using Mechanize?" I was under the assumption Mechanize provided a way to grab HTML data to process by BeautifulSoup4, i.e. response = br.open(url), soup = response.read(), before using soup.findAll() method to return all elements sought... "Are you trying to simulate an actual browser?" No. "Why are you trying to download a file just to see if wget can download it?" I'm not trying to DL a file, and then have wget DL it. Well, not least what I intended anyway. I'm a novice. If there's a better way of doing it I'll happily review your suggestion/code. – uncle-junky Nov 29 '13 at 00:34
  • Mechanize attempts to simulate a browser. It can grab HTML data, but there are much easier ways to do that, without using such a heavy-duty tool. Also, if you're not trying to DL a file, and then have wget DL it, what are you trying to do? Do you need wget involved for some reason? If you can already know how to get the data, why not just save the data into a file? – abarnert Nov 29 '13 at 00:36

1 Answers1

1

I don't think there's any good way to get Mechanize to do what you want. The whole point of Mechanize is that it's trying to simulate a browser visiting a URL, and a browser visiting a URL downloads the page. If you don't want to do that, don't use an API designed for that.

On top of that, whatever API you're using, by sending a GET request for the URL, you're asking the server to send you the entire response. Why do that just to hang up on it as soon as possible? Use the HEAD request to ask the server whether it's available. (Sometimes servers won't HEAD things even when they should, so you'll have to fall back to GET. But cross that bridge if you come to it.)

For example:

req = urllib.request.Request(url, method='HEAD')
resp = urllib.request.urlopen(req)
return 200 <= resp.code < 300

But this raises a question:

When the function returns True I proceed to download the MP3 with wget.

Why? Why not just use wget in the first place? If the URL is gettable, it will get the URL; if not, it will give you an error—just as easily as Mechanize will. And that avoids hitting each URL twice.

For that matter, why try to script wget, instead of using the built-in support in the stdlib or a third-party module like requests?


If you're just looking for a way to parallelize things, that's easy to do in Python:

def is_good_url(url):
    req = urllib.request.Request(url, method='HEAD')
    resp = urllib.request.urlopen(req)
    return url, 200 <= resp.code < 300

with futures.ThreadPoolExecutor(max_workers=8) as executor:
    fs = [executor.submit(is_good_url, url) for url in urls]
    results = (f.result() for f in futures.as_completed(fs))
    good_urls = [url for (url, good) in results if good]

And to change this to actually download the valid URLs instead of just making a note of which ones are valid, just change the task function to something that fetches and saves the data from a GET instead of doing the HEAD thing. The ThreadPoolExecutor Example in the docs does almost exactly what you want.

abarnert
  • 354,177
  • 51
  • 601
  • 671