1

I am developing an application for crawling the web using crawler4j and Jsoup. I need to parse a webpage using JSoup and check if it has zip files, pdf/doc and mp3/mov file available as a resource for download.

For zip files i did the following and it works:

Elements zip = doc.select("a[href\$=.zip]")
        println "No of zip files is " + zip.size() 

This code correctly tells me how many zip files are there in a page. I am not sure how to count all audio files or document files using JSoup. Any help is appreciated. Thanks.

clever_bassi
  • 2,392
  • 2
  • 24
  • 43

1 Answers1

2

Using the same approach I suspect it would be something like this:

Elements docs = doc.select("a[href\$=.doc]")
        println "No of doc files is " + docs.size() 

Elements mp3s = doc.select("a[href\$=.mp3]")
        println "No of mp3 files is " + mp3s.size() 

Really it's just a selector where the href attribute ends in some file extension.

Joshua Moore
  • 24,706
  • 6
  • 50
  • 73