I'm trying to learn scrappy with python. Ive used this website , which might be out of date a bit, but ive managed to get the links and urls as intended. ALMOST.
import scrapy, time
import random
#from random import randint
#from time import sleep
USER_AGENT = "Mozilla/5.0"
class OscarsSpider(scrapy.Spider):
name = "oscars5"
allowed_domains = ["en.wikipedia.org"]
start_urls = ["https://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture"]
def parse(self, response):
for href in (r"tr[style='background:#FAEB86'] a[href*='film)']::attr(href)"): #).extract(): Once you extract it, it becomes a string so the library can no longer process it - so dont extarct it/ - https://stackoverflow.com/questions/57417774/attributeerror-str-object-has-no-attribute-xpath
url = response.urljoin(href)
print(url)
time.sleep(random.random()) #time.sleep(0.1) #### https://stackoverflow.com/questions/4054254/how-to-add-random-delays-between-the-queries-sent-to-google-to-avoid-getting-blo #### https://stackoverflow.com/questions/30030659/in-python-what-is-the-difference-between-random-uniform-and-random-random
req = scrapy.Request(url, callback=self.parse_titles)
time.sleep(random.random()) #sleep(randint(10,100))
##req.meta['proxy'] = "http://yourproxy.com:178" #https://checkerproxy.net/archive/2021-03-10 (from ; https://stackoverflow.com/questions/30330034/scrapy-error-error-downloading-could-not-open-connect-tunnel)
yield req
def parse_titles(self, response):
for sel in response.css('html').extract():
data = {}
data['title'] = response.css(r"h1[id='firstHeading'] i::text").extract()
data['director'] = response.css(r"tr:contains('Directed by') a[href*='/wiki/']::text").extract()
data['starring'] = response.css(r"tr:contains('Starring') a[href*='/wiki/']::text").extract()
data['releasedate'] = response.css(r"tr:contains('Release date') li::text").extract()
data['runtime'] = response.css(r"tr:contains('Running time') td::text").extract()
yield data
The problem I have is the scraper retrieves only the 1st character of the href links which i cant wrap my head around now. I cant understand why or how to fix it .
Snippet of output when I run the spider in CMD:
2021-03-12 20:09:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Category:Best_Picture_Academy_Award_winners> (referer: None)
https://en.wikipedia.org/wiki/t
https://en.wikipedia.org/wiki/r
https://en.wikipedia.org/wiki/[
2021-03-12 20:09:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://en.wikipedia.org/wiki/T> from <GET https://en.wikipedia.org/wiki/t>
https://en.wikipedia.org/wiki/s
https://en.wikipedia.org/wiki/t
2021-03-12 20:10:01 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://en.wikipedia.org/wiki/t> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-03-12 20:10:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/T> (referer: https://en.wikipedia.org/wiki/Category:Best_Picture_Academy_Award_winners)
https://en.wikipedia.org/wiki/y
It does crawl, it finds those links (or starting points of the links we need), im sure. and appends, but doesnt get entire title or link. Hence I'm only scraping incorrect non-existent pages ! the output files are perfectly formatted but with no data apart from empty strings.
https://en.wikipedia.org/wiki/t in the spider output above for example should be https://en.wikipedia.org/wiki/The_Artist_(film)
and
https://en.wikipedia.org/wiki/r should and could be https://en.wikipedia.org/wiki/rain_man(film)
etc.
in scrapy shell
,
response.css("h1[id='firstHeading'] i::text").extract()
returns []
confirming my fears. Its the selector.
How can I fix it?
As its not working as it should do or its was claimed to. If anyone could help I would be very grateful.