When running Scrapy from an own script that loads URLs from a DB and follows all internal links on those websites, I encounter a pitty. I need to know which start_url is currently used as I have to maintain consistency with a database (SQL DB). But: When Scrapy uses the built-in list called 'start_urls' in order to receive a list of links to follow and those websites have an immediate redirect, a problem occurs. For example, when Scrapy starts and the start_urls are being crawled and the crawler follows all internal links that are being found there, I later can only determine the currently visited URL, not the start_url where Scrapy started out.
Other answers from the web are wrong, for other use cases or deprecated as there seems to have been a change in Scrapy's code last year.
MWE:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
class CustomerSpider(CrawlSpider):
name = "my_crawler"
rules = [Rule(LinkExtractor(unique=True), callback="parse_obj", ), ]
def parse_obj(self, response):
print(response.url) # find current start_url and do something
a = CustomerSpider
a.start_urls = ["https://upb.de", "https://spiegel.de"] # I want to re-identify upb.de in the crawling process in process.crawl(a), but it is redirected immediately # I have to hand over the start_urls this way, as I use the class CustomerSpider in another class
a.allowed_domains = ["upb.de", "spiegel.de"]
process = CrawlerProcess()
process.crawl(a)
process.start()
Here, I provide an MWE where Scrapy (my crawler) receives a list of URLs like I have to do it. An example redirection-url is https://upb.de which redirects to https://uni-paderborn.de.
I am searching for an elegant way of handling this as I want to make use of Scrapy's numerous features such as parallel crawling etc. Thus, I do not want to use something like the requests-library additionally. I want to find the Scrapy start_url which is currently used internally (in the Scrapy library). I appreciate your help.