2

When running Scrapy from an own script that loads URLs from a DB and follows all internal links on those websites, I encounter a pitty. I need to know which start_url is currently used as I have to maintain consistency with a database (SQL DB). But: When Scrapy uses the built-in list called 'start_urls' in order to receive a list of links to follow and those websites have an immediate redirect, a problem occurs. For example, when Scrapy starts and the start_urls are being crawled and the crawler follows all internal links that are being found there, I later can only determine the currently visited URL, not the start_url where Scrapy started out.

Other answers from the web are wrong, for other use cases or deprecated as there seems to have been a change in Scrapy's code last year.

MWE:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess

class CustomerSpider(CrawlSpider):
    name = "my_crawler"
    rules = [Rule(LinkExtractor(unique=True), callback="parse_obj", ), ]

    def parse_obj(self, response):
        print(response.url)  # find current start_url and do something

a = CustomerSpider
a.start_urls = ["https://upb.de", "https://spiegel.de"]  # I want to re-identify upb.de in the crawling process in process.crawl(a), but it is redirected immediately  # I have to hand over the start_urls this way, as I use the class CustomerSpider in another class
a.allowed_domains = ["upb.de", "spiegel.de"]

process = CrawlerProcess()

process.crawl(a)
process.start()

Here, I provide an MWE where Scrapy (my crawler) receives a list of URLs like I have to do it. An example redirection-url is https://upb.de which redirects to https://uni-paderborn.de.

I am searching for an elegant way of handling this as I want to make use of Scrapy's numerous features such as parallel crawling etc. Thus, I do not want to use something like the requests-library additionally. I want to find the Scrapy start_url which is currently used internally (in the Scrapy library). I appreciate your help.

junkmaster
  • 141
  • 1
  • 11
  • Hi junkmaster, welcome to SO! Please update your question to include what you mean by `start_url`, as this question is currently ambiguous. The [how to ask](https://stackoverflow.com/help/how-to-ask) page may help clarify what criteria make for a good question on SO, with especial emphasis on the ["MCVE"](https://stackoverflow.com/help/mcve) section. Good luck! – mdaniel Sep 10 '18 at 21:25
  • I generally just use `response.url` for this – domigmr May 01 '23 at 11:32

1 Answers1

2

Ideally, you would set a meta property on the original request, and reference it later in the callback. Unfortunately, CrawlSpider doesn't support passing meta through a Rule (see #929).

You're best to build your own spider, instead of subclassing CrawlSpider. Start by passing your start_urls in as a parameter to process.crawl, which makes it available as a property on the instance. Within the start_requests method, yield a new Request for each url, including the database key as a meta value.

When parse receives the response from loading your url, run a LinkExtractor on it, and yield a request for each one to scrape it individually. Here, you can again pass meta, propagating your original database key down the chain.

The code looks like this:

from scrapy.spiders import Spider
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess


class CustomerSpider(Spider):
    name = 'my_crawler'

    def start_requests(self):
        for url in self.root_urls:
            yield Request(url, meta={'root_url': url})

    def parse(self, response):
        links = LinkExtractor(unique=True).extract_links(response)

        for link in links:
            yield Request(
                link.url, callback=self.process_link, meta=response.meta)

    def process_link(self, response):
        print {
            'root_url': response.meta['root_url'],
            'resolved_url': response.url
        }


a = CustomerSpider
a.allowed_domains = ['upb.de', 'spiegel.de']

process = CrawlerProcess()

process.crawl(a, root_urls=['https://upb.de', 'https://spiegel.de'])
process.start()

# {'root_url': 'https://spiegel.de', 'resolved_url': 'http://www.spiegel.de/video/'}
# {'root_url': 'https://spiegel.de', 'resolved_url': 'http://www.spiegel.de/netzwelt/netzpolitik/'}
# {'root_url': 'https://spiegel.de', 'resolved_url': 'http://www.spiegel.de/thema/buchrezensionen/'}
jschnurr
  • 1,181
  • 6
  • 8
  • Hey, thanks a lot! I tried to avoid scrapy.Spider, but in the end, it is the solution! However, I next got into trouble with the allowed_domains. As only upb.de was added to them, Scrapy filtered all redirected websites (uni-paderborn.de) as offsite requests. Adding an allowed_domain dynamically does not help, as this must happen upon the creation of the class instance. This answer helps with that issue: [link](https://stackoverflow.com/a/33007741/6488190) – junkmaster Sep 12 '18 at 16:00
  • But, still, it seems like the crawler does not follow links on the extracted websites. Any idea for this? Do I have to incluse LinkExtractor in def process_links? Edit: I think so. Unless anyone has a better idea? :D – junkmaster Sep 12 '18 at 16:06
  • Yes - `parse` is receiving the contents of `https://upb.de`, and extracting links. For the page at each link, `process_link` is receiving the contents. If you then want to find links on THAT page, you should run `LinkExtractor` again and create `Requests` to do so. – jschnurr Sep 13 '18 at 00:51