0

I use a spider to crwal many websites from a list. I works as I need but now I additionally want to get the connection status. When running the spider I see some 404, some 301 or some DNS errors.

How can I get the connection status into my csv?

import scrapy


class CmsSpider(scrapy.Spider):
    name = 'myspider'
    f = open("random.csv")
    start_urls = [url.strip() for url in f.readlines()]
    f.close()

    def parse(self, response):
        title = response.xpath('//title/text()').extract_first()
        url = response.request.url
        description = response.xpath('//meta[@name="description"]/@content').extract_first()

        yield {'URL': url, 'Page Title': title, 'Description': description}
deelite
  • 355
  • 4
  • 17
  • 1
    Does this answer your question? [How do I catch errors with scrapy so I can do something when I get User Timeout error?](https://stackoverflow.com/questions/31146046/how-do-i-catch-errors-with-scrapy-so-i-can-do-something-when-i-get-user-timeout) – Gallaecio Jan 08 '20 at 13:20
  • It looks like a solution for me. But I don't know how to merge with my spider. – deelite Jan 08 '20 at 17:08

1 Answers1

0

Use

status=response.status
yield {'URL': url, 'Page Title': title, 'Description': description, 'Status':status}

taken from Checking a url for a 404 error scrapy

resp.getcode() would only work if urllib2 instead of scrapy.http.Response would be used which would not be the correct way.

Thomas Strub
  • 1,275
  • 7
  • 20
  • Ok ... should have used the 2nd example from https://stackoverflow.com/questions/15865611/checking-a-url-for-a-404-error-scrapy – Thomas Strub Jan 08 '20 at 14:02