2

I am trying to make a webcrawler in Groovy. I am looking to extract the resource types from a webpage. I need to check if a particular webpage has the following resource types:

PDFs

JMP Files

SWF Files

ZIP Files

MP3 Files

Images

Movie Files

JSL Files

I am working with crawler4j for crawling and JSoup for parsing. In general I would like to know any approach for getting any resource type that I may need in future. I tried the following in my BasicCrawler.groovy. It just tells the content type of the page i.e. text/html or text/xml. I need to get all the types of resource on that page. Please correct me where I am going wrong:

@Override
void visit(Page page) {
    println "inside visit"
    int docid = page.getWebURL().getDocid()
    url =  page.getWebURL().getURL()
    String domain = page.getWebURL().getDomain()
    String path = page.getWebURL().getPath()
    String subDomain = page.getWebURL().getSubDomain()
    parentUrl = page.getWebURL().getParentUrl()
    String anchor = page.getWebURL().getAnchor()
    println("Docid: ${docid} ")
    println("URL: ${url}  ")
    Document doc = Jsoup.connect(url).get();
    Elements nextLinks = doc.body().select("[href]");
    for( Element link : nextLinks ) {
        String contentType = new URL(link.attr("href")).openConnection().getContentType();
        println url + "***" + contentType
    }
    if (page.getParseData() instanceof HtmlParseData) {
        HtmlParseData htmlParseData = (HtmlParseData) page.getParseData()
        String text = htmlParseData.getText()
        String html = htmlParseData.getHtml()
        List<WebURL> links = htmlParseData.getOutgoingUrls()

    }
    println("FINISHED CRAWLING")
    def crawlObj = new Resource(url : url)
    if (!crawlObj.save(flush: true, failOnError: true)) {
        crawlObj.errors.each { println it }
    }
}

After printing two doc ids, it throws the error: ERROR crawler.WebCrawler - Exception while running the visit method. Message: 'unknown protocol: tel' at java.net.URL.<init>(URL.java:592)

clever_bassi
  • 2,392
  • 2
  • 24
  • 43

2 Answers2

3

You could check for all URLs in the Document and ask the server for the content type. Here is a quick+dirty example:

Document doc = Jsoup.connect("http://yourpage").get();
Elements elements = doc.body().select("[href]");
for (Element element : elements) {
    String contentType = new URL(element.attr("href")).openConnection().getContentType();
}

For images, embedded elements and so on you should search for the src attribute.

lefloh
  • 10,653
  • 3
  • 28
  • 50
  • I will accept it as soon as I try. It looks like exactly what I might need. Thanks. – clever_bassi Jun 24 '14 at 13:32
  • Content Type will just tell about text/html I think. I need to know the varies resources that are mentioned on the webpage. – clever_bassi Jun 24 '14 at 15:59
  • 1
    If the server does not lie to you he will return the Content-Type for the requested document. For instance `application/pdf` for a pdf-file or `application/zip` for a zip-file. Be sure to try it on a page which does not only link to other html-pages. – lefloh Jun 24 '14 at 16:01
  • You mean that content type would enlist all the different types of content(like mentioned above) for a webpage? Thanks. I didn't know this. I will try. :) – clever_bassi Jun 24 '14 at 16:03
  • Like I had thought, it just told me about text/html and text/xml. It did not tell about all the resource types in a page – clever_bassi Jul 02 '14 at 18:51
  • I didnt get you. could you please elaborate – clever_bassi Jul 02 '14 at 19:35
  • It would be easier to help you if you add the url you are parsing to your example. – lefloh Jul 02 '14 at 19:41
  • Ok I updated the question. Basically I have integrated crawler4j with jsoup – clever_bassi Jul 02 '14 at 19:54
  • My example code is working for this url if you search for the `src` attribute. For the `href` attribute you are always getting `text/html` because all links are pointing to other html sites. I can't find a link to a pdf or any other file on this page. – lefloh Jul 02 '14 at 19:56
  • For me, after printing two doc ids, it throws the error: ERROR crawler.WebCrawler - Exception while running the visit method. Message: 'unknown protocol: tel' at java.net.URL.(URL.java:592) – clever_bassi Jul 02 '14 at 19:57
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/56669/discussion-between-lefloh-and-ayushi). – lefloh Jul 02 '14 at 19:58
2

Apache Tika covers a lot of those formats

http://tika.apache.org

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

And those that it doesn't, you should be able to write a recogniser

tim_yates
  • 167,322
  • 27
  • 342
  • 338