I use Scrapy framework to crawl data. My crawler will be interrupted if it encounters a 500 error. So I need to check an available link before I parse a web content.
Is there any approach to resolve my problem?
Thank you so much.
Asked
Active
Viewed 268 times
0

Vinicius Cainelli
- 847
- 6
- 11

Thinh Phan
- 655
- 1
- 14
- 27
-
what does "crawler will be interrupted" mean? the process terminates? do you have some debug output from the log to show us? what does "check an available link" mean? do you want to issue a HEAD request before the GET? – Steven Almeroth Aug 30 '12 at 17:15
-
what does "crawler will be interrupted" mean? the process terminates? -> yes what does "check an available link" mean? I would like to check only the links, which have the response 200. – Thinh Phan Aug 31 '12 at 04:34
-
is there an error? can you show us a stacktrace? a single 500 response will not terminate the process by default, maybe you can show us the debug log output. – Steven Almeroth Aug 31 '12 at 17:18
-
do you have the HttpErrorMiddleware enabled or are you adjusting handle_httpstatus_list in yourspider, the default behaviour is for scrapy to not process 500 responses. – Steven Almeroth Aug 31 '12 at 17:22
1 Answers
1
If the url exists you could use the getcode() method of urllib to check it:
import urllib
import sys
webFile = urllib.urlopen('http://www.some.url/some/file')
returnCode = webFile.getCode()
if returnCode == 500:
sys.exit()
# in other case do something.

rebeco
- 36
- 3