How to get the resource types from a webpage using JSoup?

Question

I am trying to make a webcrawler in Groovy. I am looking to extract the resource types from a webpage. I need to check if a particular webpage has the following resource types:

PDFs

JMP Files

SWF Files

ZIP Files

MP3 Files

Images

Movie Files

JSL Files

I am working with crawler4j for crawling and JSoup for parsing. In general I would like to know any approach for getting any resource type that I may need in future. I tried the following in my BasicCrawler.groovy. It just tells the content type of the page i.e. text/html or text/xml. I need to get all the types of resource on that page. Please correct me where I am going wrong:

@Override
void visit(Page page) {
    println "inside visit"
    int docid = page.getWebURL().getDocid()
    url =  page.getWebURL().getURL()
    String domain = page.getWebURL().getDomain()
    String path = page.getWebURL().getPath()
    String subDomain = page.getWebURL().getSubDomain()
    parentUrl = page.getWebURL().getParentUrl()
    String anchor = page.getWebURL().getAnchor()
    println("Docid: ${docid} ")
    println("URL: ${url}  ")
    Document doc = Jsoup.connect(url).get();
    Elements nextLinks = doc.body().select("[href]");
    for( Element link : nextLinks ) {
        String contentType = new URL(link.attr("href")).openConnection().getContentType();
        println url + "***" + contentType
    }
    if (page.getParseData() instanceof HtmlParseData) {
        HtmlParseData htmlParseData = (HtmlParseData) page.getParseData()
        String text = htmlParseData.getText()
        String html = htmlParseData.getHtml()
        List<WebURL> links = htmlParseData.getOutgoingUrls()

    }
    println("FINISHED CRAWLING")
    def crawlObj = new Resource(url : url)
    if (!crawlObj.save(flush: true, failOnError: true)) {
        crawlObj.errors.each { println it }
    }
}

After printing two doc ids, it throws the error: ERROR crawler.WebCrawler - Exception while running the visit method. Message: 'unknown protocol: tel' at java.net.URL.<init>(URL.java:592)

score 3 · Accepted Answer · answered Jun 24 '14 at 09:51

3

You could check for all URLs in the Document and ask the server for the content type. Here is a quick+dirty example:

Document doc = Jsoup.connect("http://yourpage").get();
Elements elements = doc.body().select("[href]");
for (Element element : elements) {
    String contentType = new URL(element.attr("href")).openConnection().getContentType();
}

For images, embedded elements and so on you should search for the src attribute.

answered Jun 24 '14 at 09:51

lefloh

10,653
3
28
50

I will accept it as soon as I try. It looks like exactly what I might need. Thanks. – clever_bassi Jun 24 '14 at 13:32
Content Type will just tell about text/html I think. I need to know the varies resources that are mentioned on the webpage. – clever_bassi Jun 24 '14 at 15:59
1

If the server does not lie to you he will return the Content-Type for the requested document. For instance `application/pdf` for a pdf-file or `application/zip` for a zip-file. Be sure to try it on a page which does not only link to other html-pages. – lefloh Jun 24 '14 at 16:01
You mean that content type would enlist all the different types of content(like mentioned above) for a webpage? Thanks. I didn't know this. I will try. :) – clever_bassi Jun 24 '14 at 16:03
Like I had thought, it just told me about text/html and text/xml. It did not tell about all the resource types in a page – clever_bassi Jul 02 '14 at 18:51
I didnt get you. could you please elaborate – clever_bassi Jul 02 '14 at 19:35
It would be easier to help you if you add the url you are parsing to your example. – lefloh Jul 02 '14 at 19:41
Ok I updated the question. Basically I have integrated crawler4j with jsoup – clever_bassi Jul 02 '14 at 19:54
My example code is working for this url if you search for the `src` attribute. For the `href` attribute you are always getting `text/html` because all links are pointing to other html sites. I can't find a link to a pdf or any other file on this page. – lefloh Jul 02 '14 at 19:56
For me, after printing two doc ids, it throws the error: ERROR crawler.WebCrawler - Exception while running the visit method. Message: 'unknown protocol: tel' at java.net.URL.(URL.java:592) – clever_bassi Jul 02 '14 at 19:57
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/56669/discussion-between-lefloh-and-ayushi). – lefloh Jul 02 '14 at 19:58

tim_yates · Answer 2 · 2014-06-23T20:01:52.720

2

Apache Tika covers a lot of those formats

http://tika.apache.org

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

And those that it doesn't, you should be able to write a recogniser

edited Jun 23 '14 at 20:01

answered Jun 23 '14 at 19:49

tim_yates

167,322
27
342
338

So would you suggest using different libraries for different functionality in the same project? Would it lead to any conflicts? I am using JSoup for some other parts of this project that's why I wanted to know. – clever_bassi Jun 23 '14 at 20:04
Not sure how an HTML parser will detect pdfs – tim_yates Jun 23 '14 at 20:11
I just want to know if a webpage contains any pdf files. Do you think that's not possible? – clever_bassi Jun 23 '14 at 20:11
1

You could scan for links ending in PDF, but you'll get some false positives and miss some if they are being done sort of processing – tim_yates Jun 23 '14 at 20:29
I am thinking that JSoup can work with CSS Selectors. There is a CSS Selector : [attribute$=value] a[src$=".pdf"] Selects every element whose src attribute value ends with ".pdf" Do you think I can use this for getting all the resources from a page? – clever_bassi Jun 24 '14 at 15:57

How to get the resource types from a webpage using JSoup?

2 Answers2