0

I have a list of url and many of them are invalid. When I use scrapy to crawl, the engine will automatically filter those urls with 404 status code, but some urls' status code aren't 404 and will be crawled so when I open it, it says something like there's nothing here or the domain has been changed, etc. Can someone let me know how to filter these types of invalid urls?

jason
  • 25
  • 1
  • 8
  • `if requests.post(url).status_code !=200: print(error) ` – sahasrara62 Jul 26 '19 at 08:37
  • Hi, this is not what I am asking. The Scrapy already filtered those urls with status code not equal to 200. What I am asking is that how to check other types of invalid url. Those that can be actually opened, but nothing in there or show some error messages. Thank you :) – jason Jul 26 '19 at 08:42
  • Duplicate: https://stackoverflow.com/questions/15865611/checking-a-url-for-a-404-error-scrapy – PySaad Jul 26 '19 at 08:49

3 Answers3

1

I already did a project that's how the code was

In your parse function

def parse(self, response):
    if response.status == 200:
        #do what you want
Chami Mohammed
  • 309
  • 3
  • 8
0
for i in list_data:
    if requests.get(i).status_code!=200:
        print error
harry
  • 199
  • 1
  • 9
0

In your callback (e.g. parse) implement checks that detect those cases of 200 responses that are not valid, and exit the callback right away (return) when you detect one of those requests.

Gallaecio
  • 3,620
  • 2
  • 25
  • 64