Scrapy crawls ok, but link extractors do not work as intented (extracts 1st character only of the whole url)

Question

I'm trying to learn scrappy with python. Ive used this website , which might be out of date a bit, but ive managed to get the links and urls as intended. ALMOST.

    import scrapy, time
    import random
    #from random import randint
    #from time import sleep
     
    USER_AGENT = "Mozilla/5.0"
     
    class OscarsSpider(scrapy.Spider):
       name = "oscars5"
       allowed_domains = ["en.wikipedia.org"]
       start_urls = ["https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture"]
     
       def parse(self, response):
           for href in (r"tr[style='background:#FAEB86'] a[href*='film)']::attr(href)"): #).extract(): Once you extract it, it becomes a string so the library can no longer process it - so dont extarct it/ - https://stackoverflow.com/questions/57417774/attributeerror-str-object-has-no-attribute-xpath
                url = response.urljoin(href)
                print(url)
                time.sleep(random.random()) #time.sleep(0.1) #### https://stackoverflow.com/questions/4054254/how-to-add-random-delays-between-the-queries-sent-to-google-to-avoid-getting-blo #### https://stackoverflow.com/questions/30030659/in-python-what-is-the-difference-between-random-uniform-and-random-random
                req = scrapy.Request(url, callback=self.parse_titles)
                time.sleep(random.random()) #sleep(randint(10,100))
                ##req.meta['proxy'] = "http://yourproxy.com:178" #https://checkerproxy.net/archive/2021-03-10 (from ; https://stackoverflow.com/questions/30330034/scrapy-error-error-downloading-could-not-open-connect-tunnel)
                yield req
     
       def parse_titles(self, response):
           for sel in response.css('html').extract():
               data = {}
               data['title'] = response.css(r"h1[id='firstHeading'] i::text").extract()
               data['director'] = response.css(r"tr:contains('Directed by') a[href*='/wiki/']::text").extract()
               data['starring'] = response.css(r"tr:contains('Starring') a[href*='/wiki/']::text").extract()
               data['releasedate'] = response.css(r"tr:contains('Release date') li::text").extract()
               data['runtime'] = response.css(r"tr:contains('Running time') td::text").extract()
     
           yield data

The problem I have is the scraper retrieves only the 1st character of the href links which i cant wrap my head around now. I cant understand why or how to fix it .

Snippet of output when I run the spider in CMD:

    2021-03-12 20:09:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Category:Best_Picture_Academy_Award_winners> (referer: None)
    https://en.wikipedia.org/wiki/t
    https://en.wikipedia.org/wiki/r
    https://en.wikipedia.org/wiki/[
    2021-03-12 20:09:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://en.wikipedia.org/wiki/T> from <GET https://en.wikipedia.org/wiki/t>
    https://en.wikipedia.org/wiki/s
    https://en.wikipedia.org/wiki/t
    2021-03-12 20:10:01 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://en.wikipedia.org/wiki/t> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
    2021-03-12 20:10:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/T> (referer: https://en.wikipedia.org/wiki/Category:Best_Picture_Academy_Award_winners)
    https://en.wikipedia.org/wiki/y

It does crawl, it finds those links (or starting points of the links we need), im sure. and appends, but doesnt get entire title or link. Hence I'm only scraping incorrect non-existent pages ! the output files are perfectly formatted but with no data apart from empty strings.

https://en.wikipedia.org/wiki/t in the spider output above for example should be https://en.wikipedia.org/wiki/The_Artist_(film)

and

https://en.wikipedia.org/wiki/r should and could be https://en.wikipedia.org/wiki/rain_man(film)

etc.

in scrapy shell,

response.css("h1[id='firstHeading'] i::text").extract()

returns []

confirming my fears. Its the selector.

How can I fix it?

As its not working as it should do or its was claimed to. If anyone could help I would be very grateful.

John Gordon · Answer 1 · 2021-03-12T21:05:46.317

0

for href in (r"tr[style='background:#FAEB86'] a[href*='film)']::attr(href)"):

This is just doing for x in "abcde", which iterates over each letter in the string, which is why you get t, r, [, s, ...

Is this really what you intended? The parentheses sort of suggest that you intended this to be a function call. As a plain string, it makes no sense.

edited Mar 12 '21 at 21:05

answered Mar 12 '21 at 20:46

John Gordon

29,573
7
33
58

Obviously not. But thanks for your help. You have helped alot. – David Wooley - AST Mar 12 '21 at 21:21
fixed it by doing [code]for href in response.css(r"tr[style='background:#FAEB86'] a[href*='film)']::attr(href)").extract(): [/code] instead ! Thank you for your help again ! – David Wooley - AST Mar 12 '21 at 21:30

Scrapy crawls ok, but link extractors do not work as intented (extracts 1st character only of the whole url)

1 Answers1