3

I have this simply code:

import scrapy
import re
import json
# from scrapy.http import FormRequest
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class SpiderRecipe(CrawlSpider):
    name = "recipe"
    start_urls = [
        # 'https://www.giallozafferano.it/',
        'https://ricetta.it/dolci?page=1',
        # 'https://www.buonissimo.it/',
        # 'https://migusto.migros.ch/it.html'
    ]

    def parse(self,response):
        URL = response.request.url()
        if URL.split('/')[2] == "www.ricetta.it":

        recipes = response.xpath('//div[contains(@class,"row")]/div[contains(@class,"post-img-left")]').extract()
        # iterate through each recipe in a page
        for x in recipes.extract():
            title = response.xpath(recipes + '/a[contains(@class, "post-title")]/text()').extract()[x]
            image = response.xpath(recipes + '/div[contains(@class,"videoContainer")]/img/@src').extract()[x]
            description = response.xpath(recipes + '/p[contains(@class,"post-excerpt")]/text()').extract()[x]
            yield {
                'Title': title,
                'Image': image,
                'Description': description,
            }
            page = int(URL.split('=')[1]) + 1;
            if (page <= 148):
                # iterate through each page of recipes
                yield scrapy.Request(URL.split('=')[0] + str(page), callback=self.parse, dont_filter=True)

It is called by the terminal using scrapy runspider recipe.py -o output.json.

The first part of the codw works, because it can take the starting URL, but I don't understand why the parse function is not called, also if the code isn't correct I tried to print at the beginning of the function a string but it didn't work. I tried to check for solutions, but my function is inside the class and I have correctly inserted the url from where we have to start (the link is correct). Maybe it is something very easy but I cannot find it. I also read that the function must be called but in the examples no one does it, and in addition I continuously call it at the end of the code.

Ele975
  • 352
  • 3
  • 13
  • where are you calling `parse()` ? – Patrick Nov 23 '21 at 11:34
  • In the code above, the `SpiderRecipe` class is declared, but not instantiated at any moment; could you also post the part code where you instantiate it? – Haroldo_OK Nov 23 '21 at 11:37
  • 1
    @Haroldo_OK I call the code in the terminal using scrapy runspider recipe.py -o output.json, if this is what you're saying. The first part works, I can print the url_starts, but the method is never called – Ele975 Nov 23 '21 at 11:48
  • @Patrick as I said, I don't think I've to call parse() because it is suppose to be automatic by calling the spider in the terminal – Ele975 Nov 23 '21 at 11:49
  • @Patrick this means that I have to add for example a method start that yeld a request calling the method parse? Like the method def start_requests(self) in the second example of this link? https://docs.scrapy.org/en/latest/intro/tutorial.html – Ele975 Nov 23 '21 at 11:55

2 Answers2

2

I solved the problem. I have in addition an environment for python in another folder, then I have to activate first the environment, and then I can start scrapy from the terminal where is my spider. The class doesn't have to be instantiate and the methods don't have to be called manually because Scrapy does it automatically.

Ele975
  • 352
  • 3
  • 13
0

I had faced this same problem when i started to learn scrappy and figured out it was from the settings.py file

changing
ROBOTSTXT_OBEY = True
To 
ROBOTSTXT_OBEY = False

helps me hope that helps you as well