Try using the Requests
library. On my end, there seems to be no rate-limiting that I've seen. I was able to retrieve 13 titles in 21.6s. See below:
Code:
import requests as rq
from bs4 import BeautifulSoup as bsoup
def get_title(url):
r = rq.get(url)
soup = bsoup(r.content)
title = soup.find_all("title")[0].get_text()
print title.split(" - ")[0]
def main():
urls = [
"http://www.wikiart.org/en/henri-rousseau/tiger-in-a-tropical-storm-surprised-1891",
"http://www.wikiart.org/en/edgar-degas/the-green-dancer-1879",
"http://www.wikiart.org/en/claude-monet/dandelions",
"http://www.wikiart.org/en/albrecht-durer/the-little-owl-1506",
"http://www.wikiart.org/en/gustav-klimt/farmhouse-with-birch-trees-1903",
"http://www.wikiart.org/en/jean-michel-basquiat/boxer",
"http://www.wikiart.org/en/fernand-leger/three-women-1921",
"http://www.wikiart.org/en/alphonse-mucha/flower-1897",
"http://www.wikiart.org/en/alphonse-mucha/ruby",
"http://www.wikiart.org/en/georges-braque/musical-instruments-1908",
"http://www.wikiart.org/en/rene-magritte/the-evening-gown-1954",
"http://www.wikiart.org/en/m-c-escher/lizard-1",
"http://www.wikiart.org/en/johannes-vermeer/the-girl-with-a-pearl-earring"
]
for url in urls:
get_title(url)
if __name__ == "__main__":
main()
Output:
Tiger in a Tropical Storm (Surprised!)
The Green Dancer
Dandelions
The Little Owl
Farmhouse with Birch Trees
Boxer
Three Women
Flower
Ruby
Musical Instruments
The evening gown
Lizard
The Girl with a Pearl Earring
[Finished in 21.6s]
However, out of personal ethics, I don't recommend doing it like this. With a fast connection, you'll pull data too fast. Allowing the scrape to sleep every 20 pages or so for a few seconds won't hurt.
EDIT: An even faster version, using grequests
, which allows asynchronous requests to be made. This pulls the same data above in 2.6s, nearly 10 times faster. Again, limit your scrape speed out of respect for the site.
import grequests as grq
from bs4 import BeautifulSoup as bsoup
def get_title(response):
soup = bsoup(response.content)
title = soup.find_all("title")[0].get_text()
print title.split(" - ")[0]
def main():
urls = [
"http://www.wikiart.org/en/henri-rousseau/tiger-in-a-tropical-storm-surprised-1891",
"http://www.wikiart.org/en/edgar-degas/the-green-dancer-1879",
"http://www.wikiart.org/en/claude-monet/dandelions",
"http://www.wikiart.org/en/albrecht-durer/the-little-owl-1506",
"http://www.wikiart.org/en/gustav-klimt/farmhouse-with-birch-trees-1903",
"http://www.wikiart.org/en/jean-michel-basquiat/boxer",
"http://www.wikiart.org/en/fernand-leger/three-women-1921",
"http://www.wikiart.org/en/alphonse-mucha/flower-1897",
"http://www.wikiart.org/en/alphonse-mucha/ruby",
"http://www.wikiart.org/en/georges-braque/musical-instruments-1908",
"http://www.wikiart.org/en/rene-magritte/the-evening-gown-1954",
"http://www.wikiart.org/en/m-c-escher/lizard-1",
"http://www.wikiart.org/en/johannes-vermeer/the-girl-with-a-pearl-earring"
]
rs = (grq.get(u) for u in urls)
for i in grq.map(rs):
get_title(i)
if __name__ == "__main__":
main()