Write connection status to csv

Question

I use a spider to crwal many websites from a list. I works as I need but now I additionally want to get the connection status. When running the spider I see some 404, some 301 or some DNS errors.

How can I get the connection status into my csv?

import scrapy


class CmsSpider(scrapy.Spider):
    name = 'myspider'
    f = open("random.csv")
    start_urls = [url.strip() for url in f.readlines()]
    f.close()

    def parse(self, response):
        title = response.xpath('//title/text()').extract_first()
        url = response.request.url
        description = response.xpath('//meta[@name="description"]/@content').extract_first()

        yield {'URL': url, 'Page Title': title, 'Description': description}

Does this answer your question? [How do I catch errors with scrapy so I can do something when I get User Timeout error?](https://stackoverflow.com/questions/31146046/how-do-i-catch-errors-with-scrapy-so-i-can-do-something-when-i-get-user-timeout) — Gallaecio, Jan 08 '20 at 13:20
It looks like a solution for me. But I don't know how to merge with my spider. — deelite, Jan 08 '20 at 17:08

Thomas Strub · Answer 1 · 2020-01-08T14:05:08.810

0

Use

status=response.status
yield {'URL': url, 'Page Title': title, 'Description': description, 'Status':status}

taken from Checking a url for a 404 error scrapy

resp.getcode() would only work if urllib2 instead of scrapy.http.Response would be used which would not be the correct way.

edited Jan 08 '20 at 14:05

answered Jan 08 '20 at 12:03

Thomas Strub

1,275
7
20

Ok ... should have used the 2nd example from https://stackoverflow.com/questions/15865611/checking-a-url-for-a-404-error-scrapy – Thomas Strub Jan 08 '20 at 14:02

Write connection status to csv

1 Answers1