Scrapy only scraping and crawling HTML and TXT

Question

For learning purposes, I've been trying to recursively crawl and scrape all URLs on https://triniate.com/images/, but it seems that Scrapy only wants to crawl and scrape TXT, HTML, and PHP URLs.

Here is my spider code

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from HelloScrapy.items import PageInfoItem

class HelloSpider(CrawlSpider):
    #Identifier when executing scrapy from CLI
    name = 'hello'
    #Domains that allow spiders to explore
    allowed_domains = ["triniate.com"]
    #Starting point(Start exploration)URL

    start_urls = ["https://triniate.com/images/"]
    #Specific rule with LinkExtractor argument(For example, scrape only pages that include new in the URL)Can be specified, but this time there is no argument because it targets all pages
    #When you download a page that matches the Rule, the function specified in callback will be called.
    #If follow is set to True, the search will be performed recursively.
    rules = [Rule(LinkExtractor(), callback='parse_pageinfo', follow=True)]
    
    def parse_pageinfo(self, response):
        item = PageInfoItem()
        item['URL'] = response.url
            #Specify which part of the page to scrape
            #In addition to specifying in xPath format, it is also possible to specify in CSS format
        item['title'] = "idc"
        return item

items.py contains

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


from scrapy.item import Item, Field

class PageInfoItem(Item):
    URL = Field()
    title = Field()
    pass

and the console output is

2022-04-21 22:30:50 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-21 22:30:50 [scrapy.extensions.feedexport] INFO: Stored json feed (175 items) in: haxx.json
2022-04-21 22:30:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 59541,
 'downloader/request_count': 176,
 'downloader/request_method_count/GET': 176,
 'downloader/response_bytes': 227394,
 'downloader/response_count': 176,
 'downloader/response_status_count/200': 176,
 'dupefilter/filtered': 875,
 'elapsed_time_seconds': 8.711563,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 22, 3, 30, 50, 142416),
 'httpcompression/response_bytes': 402654,
 'httpcompression/response_count': 175,
 'item_scraped_count': 175,
 'log_count/DEBUG': 357,
 'log_count/INFO': 11,
 'request_depth_max': 5,
 'response_received_count': 176,
 'scheduler/dequeued': 176,
 'scheduler/dequeued/memory': 176,
 'scheduler/enqueued': 176,
 'scheduler/enqueued/memory': 176,
 'start_time': datetime.datetime(2022, 4, 22, 3, 30, 41, 430853)}
2022-04-21 22:30:50 [scrapy.core.engine] INFO: Spider closed (finished)

Could someone please suggest how I should change my code to reflect my desired results?

EDIT: To clarify, I am trying to fetch the URL, not the image or file itself.

This is maybe because of the LinkExtractor. Have you tried the basic spider? — Raisul Islam, Apr 22 '22 at 05:06
The basic spider **saves** pages, I want to save the link of it only. (ex: http://triniate.com/images/up.gif) — HotPizza HotPizza, Apr 22 '22 at 05:11
My goal was to **recursively** grab links. On its own, the basic template does not do anything, I've also removed LinkExtractor from the original code, and it seemed to not do anything. — HotPizza HotPizza, Apr 22 '22 at 05:23

Hadi Hajihosseini · Accepted Answer · 2022-04-23T00:00:30.933

To do this you need to know how Scrapy works. First you should write a spider to recursively crawl all the directories from the root URL. And while it is visiting pages extract all the images links.

So I wrote this code for you and tested it on the website you have provided. It perfectly works.

import scrapy

class ImagesSpider(scrapy.Spider):
    name = "images"
    image_ext = ['png', 'gif']

    images_urls = set()

        start_urls = [
            'https://triniate.com/images/',
            # if there are some other urls you want to scrape the same way
            # add them in this list
        ]
        
        for url in start_urls:
            yield scrapy.Request(url=url, callback=self.get_images)


    def get_images(self, response):
        all_hrefs = response.css('a::attr(href)').getall()
        all_images_links = list(filter(lambda x: x.split('.')[-1] in self.image_ext, all_hrefs))
        
        for link in all_images_links:
            self.images_urls.add(link)
            yield {'link': f'{response.request.url}{link}'}
            
        next_page_links =  list(filter(lambda x: x[-1]=='/', all_hrefs))
        for link in next_page_links:
            yield response.follow(link, callback=self.get_images)

So this way you have all the links of all of the images provided on this page and any inside directories (recursively).

The get_images method searches for any images in the page. It gets all the images links and then also put any directory links to crawl after. So it gets all the images links of all directories.

The code I provided results in this which has all the links you want:

[
   {"link": "https://triniate.com/images/ChatIcon.png"},
   {"link": "https://triniate.com/images/Sprite1.gif"},
   {"link": "https://triniate.com/images/a.png"},
   ...
   ...
   ...
   {"link": "https://triniate.com/images/objects/house_objects/workbench.png"}
]

Note: I specified the extensions of image files in the image_ext attribute. You can extend it to all image extensions available or just include the extensions that exist in the website like I did. Your choice.

@HotPizzaHotPizza Are these urls all similar to the main url you provided? I mean do they have some images and directories like the main url? — Hadi Hajihosseini, Apr 22 '22 at 15:52
@HotPizzaHotPizza I updated the answer. In the `start_requests` method I mentioned the place you can add multiple urls. If you want to scrape multiple urls you can add more urls to that list. Please review the answer. Thnx — Hadi Hajihosseini, Apr 23 '22 at 00:02

score 0 · Answer 2 · answered Apr 22 '22 at 06:09

I tried it using basic spider along with scrapy selenium. And it works.

basic.py

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager


class BasicSpider(scrapy.Spider):
    name = 'basic'
    allowed_domains = ['triniate.com']

    def start_requests(self):
        driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
        driver.set_window_size(1920, 1080)
        driver.get("https://triniate.com/images/")

        links = driver.find_elements(By.XPATH, "//html/body/table/tbody/tr/td[2]/a")

        for link in links:
            href= link.get_attribute('href')
            yield SeleniumRequest(
            url = href
            )
            
        driver.quit()
        return super().start_requests()

    def parse(self, response):
        yield {
            'URL': response.url
        }

settings.py

added

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

output

2022-04-22 12:03:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/stand_right.gif>
{'URL': 'https://triniate.com/images/stand_right.gif'}
2022-04-22 12:03:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://triniate.com/images/walk_right_transparent.gif> (referer: None)
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_back.gif>
{'URL': 'https://triniate.com/images/walk_back.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_left_transparent.gif>
{'URL': 'https://triniate.com/images/walk_left_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_front_transparent.gif>
{'URL': 'https://triniate.com/images/walk_front_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_back_transparent.gif>
{'URL': 'https://triniate.com/images/walk_back_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_right.gif>
{'URL': 'https://triniate.com/images/walk_right.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_right_transparent.gif>
{'URL': 'https://triniate.com/images/walk_right_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.engine] INFO: Closing spider (finished)

That's good and all, but I want to fetch the images AND the folders — HotPizza HotPizza, Apr 22 '22 at 14:20
@HotPizzaHotPizza well you can do it easily. Because it is extracting all the links along with the folders and getting links from all of them. Now to get the images, enable "Media Pipeline" and you will be able to get the images. — Raisul Islam, Apr 22 '22 at 17:04

Scrapy only scraping and crawling HTML and TXT

2 Answers2

basic.py

settings.py

output