How to check if a url is valid in Scrapy?

Question

I have a list of url and many of them are invalid. When I use scrapy to crawl, the engine will automatically filter those urls with 404 status code, but some urls' status code aren't 404 and will be crawled so when I open it, it says something like there's nothing here or the domain has been changed, etc. Can someone let me know how to filter these types of invalid urls?

Hi, this is not what I am asking. The Scrapy already filtered those urls with status code not equal to 200. What I am asking is that how to check other types of invalid url. Those that can be actually opened, but nothing in there or show some error messages. Thank you :) — jason, Jul 26 '19 at 08:42
Duplicate: https://stackoverflow.com/questions/15865611/checking-a-url-for-a-404-error-scrapy — PySaad, Jul 26 '19 at 08:49

score 1 · Answer 1 · answered Feb 19 '22 at 21:35

1

I already did a project that's how the code was

In your parse function

def parse(self, response):
    if response.status == 200:
        #do what you want

answered Feb 19 '22 at 21:35

Chami Mohammed

309
3
8

score 0 · Answer 2 · answered Jul 26 '19 at 12:01

0

for i in list_data:
    if requests.get(i).status_code!=200:
        print error

answered Jul 26 '19 at 12:01

harry

199
1
9

3

This would be a better answer if you explained how the code answers the question. – pppery Jul 26 '19 at 14:47

score 0 · Answer 3 · answered Aug 01 '19 at 11:29

0

In your callback (e.g. parse) implement checks that detect those cases of 200 responses that are not valid, and exit the callback right away (return) when you detect one of those requests.

answered Aug 01 '19 at 11:29

Gallaecio

3,620
2
25
64

How to check if a url is valid in Scrapy?

3 Answers3