Get all URLs in a entire site using Scrapy

Question

folks! I'm trying to get all internal URLs in entire site for SEO purposes and i recently discovered Scrapy to help me in this task. But my code always returns a error:

2017-10-11 10:32:00 [scrapy.core.engine] INFO: Spider opened
2017-10-11 10:32:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min
)
2017-10-11 10:32:00 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-11 10:32:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.**test**.com/> from
 <GET http://www.**test**.com/robots.txt>
2017-10-11 10:32:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.**test**.com/> (referer: None)
2017-10-11 10:32:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.**test**.com/> from
 <GET http://www.**test**.com>
2017-10-11 10:32:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.**test**.com/> (referer: None)
2017-10-11 10:32:03 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.**test**.com/> (referer: None)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "c:\python27\lib\site-packages\scrapy\spiders\__init__.py", line 90, in parse
    raise NotImplementedError
NotImplementedError

I change the original url.

Here's the code i'm running

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class TestSpider(scrapy.Spider):
    name = "test"
    allowed_domains = ["http://www.test.com"]
    start_urls = ["http://www.test.com"]

    rules = [Rule (LinkExtractor(allow=['.*']))]

Thanks!

EDIT:

This worked for me:

rules = (
    Rule(LinkExtractor(), callback='parse_item', follow=True),
)

def parse_item(self, response):
    filename = response.url
    arquivo = open("file.txt", "a")
    string = str(filename)
    arquivo.write(string+ '\n') 
    arquivo.close

=D

Welcome to SO! I'd suggest that you post the solution to the question as an answer. This will help future readers understand the question and the answer better. — Nisarg Shah, Oct 11 '17 at 17:34

score 2 · Accepted Answer · answered Oct 11 '17 at 14:05

The error you are getting is caused by the fact that you don't have defined parse method in your spider, which is mandatory if you base your spider on scrapy.Spider class.

For your purpose (i.e. crawling whole website) it's best to base your spider on scrapy.CrawlSpider class. Also, in Rule, you have to define callback attribute as a method that will parse every page you visit. Last one cosmetic change, in LinkExtractor, if you want to visit every page, you can leave out allow as its default value is empty tuple which means it will match all links found.

Consult a CrawlSpider example for concrete code.

Get all URLs in a entire site using Scrapy

1 Answers1