I am creating a Groovy & Grails app using MongoDB in the backend. I am using crawler4j for crawling and JSoup for parsing functionality. I need to get the http status of a URL and save it to database. I am trying the following:
@Override
void visit(Page page) {
try{
Document doc = Jsoup.connect(url).get();
Connection.Response response = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chroe/19.0.1042.0 Safari/535.21")
.timeout(10000)
.execute();
int statusCode = response.statusCode();
println "statuscode is " + statusCode
if (statusCode == 200)
urlExists = true //urlExists is a boolean variable
else
urlExists = false
//save to database
resource = new Resource(mimeType : "text/html", URLExists: urlExists)
if (!resource.save(flush: true, failOnError: true)) {
resource.errors.each { println it }
}
//other code
}catch(Exception e) {
log.error "Exception is ${e.message}"
}
}
@Override
protected void handlePageStatusCode(WebURL webUrl, int statusCode, String statusDescription) {
if (statusCode != HttpStatus.SC_OK) {
if (statusCode == HttpStatus.SC_NOT_FOUND) {
println "Broken link: " + webUrl.getURL() + ", this link was found in page: " + webUrl.getParentUrl()
}
else {
println "Non success status for link: " + webUrl.getURL() + ", status code: " + statusCode + ", description: " + statusDescription
}
}
}
The problem is as soon as I get a url with http status other than 200(ok), it directly goes to the handlePageStatusCode() method (because of inherent crawler4j functionality) and prints the non success message but it doesnt get saved to the database. Is there any way that I can save to the database when the page status is not 200? If I am doing something wrong, please tell me. Thanks