0
  1. I'm trying to get some data with images from website(IMDB) using 'scrapy' package.

  2. If there is a image_URL in div class, then i'm able to crawl data with movie poster. However, If not, my code doesn't work properly. It skipped some data associate with image.

  3. I want to fix it like no image_URL then forget about the image and just crawl data.

  4. How can I fix except part?

def parse(self, response) :

//some other lines

try:
        poster_image_url = 
        response.xpath('//div[@class="poster"]/a/img/@src').extract()[0]
        poster_image_url = [ poster_image_url.split("_V1_")[0] + "_V1_.jpg" ]

except:
        poster_image_url = None
        item['image_urls'] = poster_image_url

This is pipeline code ↓↓↓↓

class ImdbPipeline(object):

def process_item(self, item, spider):
    return item

def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
        yield scrapy.Request(image_url)

1 Answers1

0

You can use extract_first() with if check:

poster_image_url = response.xpath('//div[@class="poster"]/a/img/@src').extract_first()
if poster_image_url:
    item['image_urls'] = poster_image_url.split('_V1')[0] + '_V1_.jgp'

Alternatively you can use scrapy ItemLoader's.

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82