I am trying to make a webcrawler in Groovy. I am looking to extract the resource types from a webpage. I need to check if a particular webpage has the following resource types:
PDFs
JMP Files
SWF Files
ZIP Files
MP3 Files
Images
Movie Files
JSL Files
I am working with crawler4j for crawling and JSoup for parsing. In general I would like to know any approach for getting any resource type that I may need in future. I tried the following in my BasicCrawler.groovy. It just tells the content type of the page i.e. text/html or text/xml. I need to get all the types of resource on that page. Please correct me where I am going wrong:
@Override
void visit(Page page) {
println "inside visit"
int docid = page.getWebURL().getDocid()
url = page.getWebURL().getURL()
String domain = page.getWebURL().getDomain()
String path = page.getWebURL().getPath()
String subDomain = page.getWebURL().getSubDomain()
parentUrl = page.getWebURL().getParentUrl()
String anchor = page.getWebURL().getAnchor()
println("Docid: ${docid} ")
println("URL: ${url} ")
Document doc = Jsoup.connect(url).get();
Elements nextLinks = doc.body().select("[href]");
for( Element link : nextLinks ) {
String contentType = new URL(link.attr("href")).openConnection().getContentType();
println url + "***" + contentType
}
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData()
String text = htmlParseData.getText()
String html = htmlParseData.getHtml()
List<WebURL> links = htmlParseData.getOutgoingUrls()
}
println("FINISHED CRAWLING")
def crawlObj = new Resource(url : url)
if (!crawlObj.save(flush: true, failOnError: true)) {
crawlObj.errors.each { println it }
}
}
After printing two doc ids, it throws the error: ERROR crawler.WebCrawler - Exception while running the visit method. Message: 'unknown protocol: tel' at java.net.URL.<init>(URL.java:592)