-1

I have a script that takes a URL and returns the value of the page's <title> tag. After a few hundred or so runs, I always get the same error:

File "/home/edmundspenser/Dropbox/projects/myfiles/titlegrab.py", line 202, in get_title
    status, response = http.request(pageurl)
File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 1390, in _request
    raise RedirectLimit("Redirected more times than rediection_limit allows.", response, content)
httplib2.RedirectLimit: Redirected more times than rediection_limit allows.

My function looks like:

def get_title(pageurl):
    http = httplib2.Http()
    status, response = http.request(pageurl)
    x = BeautifulSoup(response, parseOnlyThese=SoupStrainer('title'))
    x = str(x)
    y = x[7:-8]
    z = y.split('-')[0]
    return z

Pretty straightforward. I used try and except and time.sleep(1) to give it time to maybe get unstuck if that was the issue but so far nothing has worked. And I don't want to pass on it. Maybe the website is rate-limiting me?

edit: As of right now the script doesn't work at all, it runs into said error with the first request.

I have a json file of over 80,000 URLs of www.wikiart.org painting pages. For each one I run my function to get the title. So:

print repr(get_title('http://www.wikiart.org/en/vincent-van-gogh/van-gogh-s-chair-1889'))

returns

"Van Gogh's Chair"
WGS
  • 13,969
  • 4
  • 48
  • 51
ian-campbell
  • 1,605
  • 1
  • 20
  • 42
  • 1
    use your code on another website to see if it works. if so then the site is rate-limiting you. – KooKoo Nov 20 '14 at 18:08
  • Let's say they are rate-limiting me. How do find out the rate so that I can slow down but still go as fast as possible? Is there a way I can get more out of one http request rather than sending a new one for each url? – ian-campbell Nov 20 '14 at 18:11
  • 1
    Can you specify what URL you are trying to access? – WGS Nov 20 '14 at 18:20

1 Answers1

3

Try using the Requests library. On my end, there seems to be no rate-limiting that I've seen. I was able to retrieve 13 titles in 21.6s. See below:

Code:

import requests as rq
from bs4 import BeautifulSoup as bsoup

def get_title(url):

    r = rq.get(url)
    soup = bsoup(r.content)
    title = soup.find_all("title")[0].get_text()
    print title.split(" - ")[0]

def main():

    urls = [
    "http://www.wikiart.org/en/henri-rousseau/tiger-in-a-tropical-storm-surprised-1891",
    "http://www.wikiart.org/en/edgar-degas/the-green-dancer-1879",
    "http://www.wikiart.org/en/claude-monet/dandelions",
    "http://www.wikiart.org/en/albrecht-durer/the-little-owl-1506",
    "http://www.wikiart.org/en/gustav-klimt/farmhouse-with-birch-trees-1903",
    "http://www.wikiart.org/en/jean-michel-basquiat/boxer",
    "http://www.wikiart.org/en/fernand-leger/three-women-1921",
    "http://www.wikiart.org/en/alphonse-mucha/flower-1897",
    "http://www.wikiart.org/en/alphonse-mucha/ruby",
    "http://www.wikiart.org/en/georges-braque/musical-instruments-1908",
    "http://www.wikiart.org/en/rene-magritte/the-evening-gown-1954",
    "http://www.wikiart.org/en/m-c-escher/lizard-1",
    "http://www.wikiart.org/en/johannes-vermeer/the-girl-with-a-pearl-earring"
    ]

    for url in urls:
        get_title(url)

if __name__ == "__main__":
    main()

Output:

Tiger in a Tropical Storm (Surprised!) 
The Green Dancer
Dandelions
The Little Owl
Farmhouse with Birch Trees
Boxer
Three Women
Flower
Ruby
Musical Instruments
The evening gown
Lizard
The Girl with a Pearl Earring
[Finished in 21.6s]

However, out of personal ethics, I don't recommend doing it like this. With a fast connection, you'll pull data too fast. Allowing the scrape to sleep every 20 pages or so for a few seconds won't hurt.

EDIT: An even faster version, using grequests, which allows asynchronous requests to be made. This pulls the same data above in 2.6s, nearly 10 times faster. Again, limit your scrape speed out of respect for the site.

import grequests as grq
from bs4 import BeautifulSoup as bsoup

def get_title(response):

    soup = bsoup(response.content)
    title = soup.find_all("title")[0].get_text()
    print title.split(" - ")[0]

def main():

    urls = [
    "http://www.wikiart.org/en/henri-rousseau/tiger-in-a-tropical-storm-surprised-1891",
    "http://www.wikiart.org/en/edgar-degas/the-green-dancer-1879",
    "http://www.wikiart.org/en/claude-monet/dandelions",
    "http://www.wikiart.org/en/albrecht-durer/the-little-owl-1506",
    "http://www.wikiart.org/en/gustav-klimt/farmhouse-with-birch-trees-1903",
    "http://www.wikiart.org/en/jean-michel-basquiat/boxer",
    "http://www.wikiart.org/en/fernand-leger/three-women-1921",
    "http://www.wikiart.org/en/alphonse-mucha/flower-1897",
    "http://www.wikiart.org/en/alphonse-mucha/ruby",
    "http://www.wikiart.org/en/georges-braque/musical-instruments-1908",
    "http://www.wikiart.org/en/rene-magritte/the-evening-gown-1954",
    "http://www.wikiart.org/en/m-c-escher/lizard-1",
    "http://www.wikiart.org/en/johannes-vermeer/the-girl-with-a-pearl-earring"
    ]

    rs = (grq.get(u) for u in urls)
    for i in grq.map(rs):
        get_title(i)

if __name__ == "__main__":
    main()
WGS
  • 13,969
  • 4
  • 48
  • 51
  • 1
    I like the requests method and I am using it, but I still get a similar error `requests.exceptions.TooManyRedirects: Exceeded 30 redirects.` My code completed almost 50,000 title-url pairs in a few short hours before this error started happening, so I kinda feel bad and understand if I've been limited for swamping them. – ian-campbell Nov 20 '14 at 18:50
  • 1
    Then this is possibly a temporary ban of sorts. Almost every big site out there has some sort of defense in place to prevent being scraped. Won't be surprised if Wikimedia had one for their sites. – WGS Nov 20 '14 at 18:52
  • I'll put the code on hold till tonight or tomorrow then. I'm looking forward to using `grequests` since I was looking for an upgrade to a one-at-a-time for loop. – ian-campbell Nov 20 '14 at 18:54
  • Let me know what happens then. I'm mighty curious as to how long this ban will last, if ever. Also, Wikimedia has an API but I don't rightly know if it applies to WikiArt. The API allows only 5000 queries though, and you need researcher permission for that, whatever that means. – WGS Nov 20 '14 at 18:57
  • I just realized after looking over the site: it's not part of WikiMedia. Woooh, mighty mistake on that one for me. – WGS Nov 20 '14 at 19:00
  • Exactly, wikiart is a separate thing so we'll see what happens. I don't know the rules of HTTP, but is there a way to bundle 5-10 URLs into a single HTTP request? Or does each URL by nature require a separate request? – ian-campbell Nov 20 '14 at 19:05
  • Metatron, I meant to ask, what is this snippet for: `if __name__ == "__main__": main()` – ian-campbell Nov 20 '14 at 22:23
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/65316/discussion-between-metatron-and-edmund-spenser). – WGS Nov 20 '14 at 22:51