0

I'm trying to download images from this site: http://www.domu.com/chicago/neighborhoods/humboldt-park/1641-n-maplewood-ave-apt-1-chicago-il-60647

The target site recently changed how they deliver images with the unique URL. I'm getting a 403 error when I download images. Link below. I can load each image once in a browser. After the image is loaded once, subsequent requests produce a 403 error. When I change the browser to Private mode, I can reload the image multiple times. This lead me to believe they are tracking cookies in some way. I tried to disable cookies in scrapy but continue to get a 403 error. I also tried to enable cookies, but process one request at a time. That also produces a 403 error. The target site is using a varnish server for Cache. I assume Varnish includes some anti scraping technology.

http://www.domu.com/sites/default/files/styles/gallery/public/filefield/field_img/20141117_133559.jpg?itok=pDSP-06i

Any thoughts on how to download images?

dfriestedt
  • 483
  • 1
  • 3
  • 18

3 Answers3

1

Here a possible solution using Selenium Webdriver and command wget.

By Webdriver you emulate the browser navigation and extract the unique url and download by the wget command.

from selenium import webdriver
import time
import scrapy
class domuSpider(CrawlSpider):
    name = "domu_spider"
    allowed_domains = ['domu.com']
    start_urls = ['http://www.domu.com/chicago/neighborhoods/humboldt-park/1641-n-maplewood-ave-apt-1-chicago-il-60647']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)

        for element in self.driver.find_elements_by_css_selector("img"):
            print element.get_attribute('src')
            time.sleep(1)
            os.system('wget ' + element.get_attribute('src'))
        self.driver.quit()

Documentation at http://selenium-python.readthedocs.org

aberna
  • 5,594
  • 2
  • 28
  • 33
  • definitely a reasonable solution. I was hoping to avoid selenium and figure out how to solve with scrapy. I've tried to rotate IP through proxy and user agent. Both don't solve the issue – dfriestedt Nov 27 '14 at 17:06
  • I wonder if I can write the files directly from the response in scrapy. – dfriestedt Nov 27 '14 at 17:07
  • I got your point but when I read there was the 403 issue the first good idea was Selenium. I had a similar problem few weeks ago with an ajax script which loaded images on the fly. It would be interesting to understand which way exactly the images on this site are delivered – aberna Nov 27 '14 at 17:31
  • I think I narrowed down the problem to needing to add the referer in the Header of the download middleware for the images. At least that is how I can replicate the 403 error in the browser. In the browser if the request includes a referer, no 403 error. Wihout a referer I get a 403 error. – dfriestedt Nov 27 '14 at 18:11
0

I was able to solve this problem by adding the referer to the header.

I used this post to help: How to add Headers to Scrapy CrawlSpider Requests?

Here is my custom image pipeline:

class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        return [Request(x, headers={'referer': 'http://www.domu.com'}) for x in item.get(self.IMAGES_URLS_FIELD, [])]
Community
  • 1
  • 1
dfriestedt
  • 483
  • 1
  • 3
  • 18
0

Try this one:

import these:

import scrapy
import urllib.request

and your function looks like:

def parse(self,response):
   #extract your images url
   imageurl = response.xpath("//img/@src").get()
   imagename = imageurl.split("/")[-1].split(".")
   imagename = "addsomethingcustom"+imagename[0] + imagename[-1]
   req = urllib.request.Request(imageurl, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36'})
   resource = urllib.request.urlopen(req)
   output = open("foldername/"+imagename,"wb")
   output.write(resource.read())
   output.close()