SCRAPY scraping imdb website using Xpath expression

Question

getting everything as none in output cannot figure out the issue in the code

Scraping details of top 1000 rated movies on imdb

Link :- https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating

CODE

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BestMoviesSpider(CrawlSpider):
    name = 'best_movies'
    allowed_domains = ['imdb.com']
    start_urls = ['https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating']
    
    rules = (
        Rule(LinkExtractor(restrict_xpaths="//h3[@class='lister-item-header']/a "), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        yield{
            'title' : response.xpath("//h1[@class='TitleHeader__TitleText-sc-1wu6n3d-0 cLNRlG']/text()").get(),
            'year' : response.xpath("(//li/span[@class='TitleBlockMetaData__ListItemText-sc-12ein40-2 jedhex'])[1]/text()").get(),
            'duration' : response.xpath("(//li[@class='ipc-inline-list__item'])[3]/text()").get(),
            'rating' : response.xpath("(//span[@class='AggregateRatingButton__RatingScore-sc-1il8omz-1 fhMjqK'])[2]/text()").get(),
            'director' : response.xpath("(//a[@class='ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link'])[13]/text()").get(),
            'movie_url' : response.url
        }

Are you sure if selectors are right? Can you just use pdb to test it. In your parse_item function's beginning, add this line and check if selectors are correct. `import pdb; pdb.set_trace()` — Cagatay Barin, Jun 28 '21 at 16:26
@ÇağatayBarın i checked all of them extract nothing, but i dont understand what are the correct ones then — Sherlock_oms, Jun 28 '21 at 18:15

Md. Fazlul Hoque · Answer 1 · 2021-06-28T16:15:57.483

1

Your project's xpath selection is almost completely incorrect and there is no pagination rule.This is the complete solution.You also will understand the cool thing from my solution is how to make CrawlSpider pagination.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BestMoviesSpider(CrawlSpider):
    name = 'best_movies'
    allowed_domains = ['imdb.com']
    start_urls = ['https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating']
    
    rules = (
        Rule(LinkExtractor(restrict_xpaths="//h3[@class='lister-item-header']/a "), callback='parse_item', follow=False),
        Rule(LinkExtractor(restrict_xpaths='(//*[@class="lister-page-next next-page"])[1]'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        yield{
            'title' : response.xpath('(//h1/text())[1]').get().strip(),
            'year' : response.xpath('//span[@id="titleYear"]/a/text()').get(),
            'duration' : response.xpath('normalize-space((//time/text())[1])').get(),
            'rating' : response.xpath('//*[@itemprop="ratingValue"]/text()').get(),
            'director' : response.xpath('(//*[@class="credit_summary_item"]/h4/following-sibling::a)[1]/text()').get(),
            'movie_url' : response.url
        }

edited Jun 28 '21 at 16:15

answered Jun 28 '21 at 15:46

Md. Fazlul Hoque

15,806
5
12
32

First of all, Thanks a lot your code is working perfectly fine Just that the xpath expressions you have given for the year,duration or director when i put these in the inspect element it detects nothing so what i dont understand is that how can it extract the correct info in the crawler while the inspect element is showin nothing for this xpath and also if you could let me know the way you figured out those expressions It would be a great help. thanks – Sherlock_oms Jun 28 '21 at 18:01
1

The general rule of thumb to select element using crawlSpider is click any item link like here click any movi link then you have to bring to another individual page along with another link and from this page, you have to select selector expression i.e elements for your parse items and this is the correct way to select correct element and you also may use scrapy shell to debug element selection.Here I also used scrapy shell whether my xpath expression is correct or not. But your rule selection is correct and this the only way. Thanks – Md. Fazlul Hoque Jun 28 '21 at 18:26
I understand all of this and thanks a lot ... Just that what i dont understand is that when i run your xpath expressions in the inspect element page of any movie it doesn’t return anything .... but when used in scrapy it scrapes perfectly.... just confused about that – Sherlock_oms Jun 29 '21 at 19:17

SCRAPY scraping imdb website using Xpath expression

1 Answers1