1

I have a spider that scrapes data from a webpage and writes the title, text and img url to mongoDB.

I have two functions:

def parse_news(self, response):
    item = NewsItem()
    item['_id'] = .. #key for MongoDB - Unique
    item['Title'] = ..
    item['URL'] = ..
    if len(..): #check if the article has a gallery
        for i in xrange(2, 5): #if yes iterate through all the images
                gallery_img_link = urlparse.urljoin(response.url, '%d/#gallery_photo' %i)
                yield Request(gallery_img_link, meta={'item': item}, callback=self.parse_gallery) #request the page and call the function that extracts the img url
    yield item

def parse_gallery(self, response):
    if len(response.xpath('//*[@id="gallery_photo"]/div/img/@src').extract_first()): #check if img URL exists so that if you get out of range there are no empty values
        item = response.meta['item']
        item['Gallery'] = response.xpath('//*[@id="gallery_photo"]/div/img/@src').extract_first()
        yield item

I want the item['Gallery'] to store the URL of the extracted img as an array and when the loop is finished to write those on the mongoDB.

So to pass item['Gallery'] to the second function, add img url to that and get the data to yield or write in the mongodb when the if loop is finished.

Why is that needed: the problem I am facing is with the extraction of the image URLs of the galleries. The gallery does not have a list of all images but you have to click next to get the next image URL. When clicking on next image in the gallery it refreshes the whole page and changes the URL of the page like this:

http://www.website.com/news-1-title/2/#gallery_photo for the second image and /3/#gallery_photo for the third and so on.

The function loops from 2-5 and checks if there is a img url and extracts it.

thanks in advance

vezunchik
  • 3,669
  • 3
  • 16
  • 25
endritius
  • 11
  • 3

0 Answers0