1

I'm trying to programmatically retrieve editing history pages from the MusicBrainz website. (musicbrainzngs is a library for the MB web service, and editing history is not accessible from the web service). For this, I need to login to the MB website using my username and password.

I've tried using the mechanize module, and using the login page second form (first one is the search form), I submit my username and password; from the response, it seems that I successfully login to the site; however, a further request to an editing history page raises an exception:

mechanize._response.httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

I understand the exception and the reason for it. I take full responsibility for not abusing the site (after all, any usage will be tagged with my username), I just want to avoid manually opening a page, saving the HTML and running a script on the saved HTML. Can I overcome the 403 error?

tzot
  • 92,761
  • 29
  • 141
  • 204

2 Answers2

2

The better solution is to respect the robots.txt file and simply download the edit data itself and not screen scrape MusicBrainz. You can down load the complete edit history here:

ftp://ftp.musicbrainz.org/pub/musicbrainz/data/fullexport

Look for the file mbdump-edit.tar.bz2.

And, as the leader of the MusicBrainz team, I would like to ask you to respect robots.txt and download the edit data. Thats one of the reasons why we make the edit data downloadable.

Thanks!

Mayhem
  • 21
  • 2
  • The load will be about **90** requests (only the first page of some recording edits) in a **month**, checking for pointers that some PUID association or disassociation might be mistaken; recent enough data are almost a requirement for my purposes. I just want to automate what I would in any case do manually. But if you prefer I download 1.5 GiB of data every time more recent data become available, I'll do as you suggest. Thanks for the pointers. – tzot Mar 10 '12 at 23:51
1

If you want to circumvent the site's robots.txt, you can achieve this by telling your mechanize.Browser to ignore the robots.txt file.

br = mechanize.Browser()
br.set_handle_robots(False)

Additionally, you might want to alter your browser's user agent so you dont look like a robot:

br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

Please be aware that when doing this, you're actually tricking the website into thinking you're a valid client.

Thomas Orozco
  • 53,284
  • 11
  • 113
  • 116