For learning purposes, I've been trying to recursively crawl and scrape all URLs on https://triniate.com/images/
, but it seems that Scrapy only wants to crawl and scrape TXT, HTML, and PHP URLs.
Here is my spider code
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from HelloScrapy.items import PageInfoItem
class HelloSpider(CrawlSpider):
#Identifier when executing scrapy from CLI
name = 'hello'
#Domains that allow spiders to explore
allowed_domains = ["triniate.com"]
#Starting point(Start exploration)URL
start_urls = ["https://triniate.com/images/"]
#Specific rule with LinkExtractor argument(For example, scrape only pages that include new in the URL)Can be specified, but this time there is no argument because it targets all pages
#When you download a page that matches the Rule, the function specified in callback will be called.
#If follow is set to True, the search will be performed recursively.
rules = [Rule(LinkExtractor(), callback='parse_pageinfo', follow=True)]
def parse_pageinfo(self, response):
item = PageInfoItem()
item['URL'] = response.url
#Specify which part of the page to scrape
#In addition to specifying in xPath format, it is also possible to specify in CSS format
item['title'] = "idc"
return item
items.py contains
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy.item import Item, Field
class PageInfoItem(Item):
URL = Field()
title = Field()
pass
and the console output is
2022-04-21 22:30:50 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-21 22:30:50 [scrapy.extensions.feedexport] INFO: Stored json feed (175 items) in: haxx.json
2022-04-21 22:30:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 59541,
'downloader/request_count': 176,
'downloader/request_method_count/GET': 176,
'downloader/response_bytes': 227394,
'downloader/response_count': 176,
'downloader/response_status_count/200': 176,
'dupefilter/filtered': 875,
'elapsed_time_seconds': 8.711563,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 22, 3, 30, 50, 142416),
'httpcompression/response_bytes': 402654,
'httpcompression/response_count': 175,
'item_scraped_count': 175,
'log_count/DEBUG': 357,
'log_count/INFO': 11,
'request_depth_max': 5,
'response_received_count': 176,
'scheduler/dequeued': 176,
'scheduler/dequeued/memory': 176,
'scheduler/enqueued': 176,
'scheduler/enqueued/memory': 176,
'start_time': datetime.datetime(2022, 4, 22, 3, 30, 41, 430853)}
2022-04-21 22:30:50 [scrapy.core.engine] INFO: Spider closed (finished)
Could someone please suggest how I should change my code to reflect my desired results?
EDIT: To clarify, I am trying to fetch the URL, not the image or file itself.