Questions tagged [jsoup]

Jsoup is a Java HTML parser for extracting and manipulating HTML data, using the best of DOM, CSS, and jQuery-like methods.

Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; Jsoup will create a sensible parse tree.

Example

Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the "In the news" section into a list of Elements:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Selecting specific content

The select(...) method is used to select a subset of the Elements from a a Document. This method accepts a CSS selector to specify which elements are selected and returned.

Some examples of usage, after loading or parsing an HTML document:

  • Elements links = doc.select("a[href]")

    This will select any a with a href attribute, i.e. any link on the page.

  • Elements pngs = doc.select("img[src$=.png]")

    This will select any img element where the value of the src attribute ends in .png, so this will select any image which is a PNG image.

This method returns an Elements list which contains all the elements matched by the selector.

There is an introduction on the Jsoup website, and the Javadoc page lists many more advanced possibilities, such as matching by regex, exclusions, pseudo-selectors, etc.

JavaScript support

Jsoup does not currently support JavaScript, which means that pages on which data is loaded with JavaScript will not be available when parsing using Jsoup.

If you want to get such dynamically loaded data, you can:

  • Use an alternative, such as HtmlUnit, Selenium WebDriver or ui4j.

  • Use the website's API, if it offers one,

  • To find out from where the website loads its data, usually all you need to do is send an HTTP request somewhere to get the data as JSON.

Open source

Jsoup is an open source project distributed under the liberal MIT license. The source code is available at GitHub.

jsoup implements The Web Hypertext Application Technology Working Group (WHATWG) HTML5 specification and parses HTML to the same DOM as modern browsers do.

Jsoup can be used to ...

  • Scrape and parse HTML from a URL, file, or string.
  • Find and extract data, using DOM traversal or CSS selectors.
  • Manipulate the HTML elements, attributes, and text.
  • Clean user-submitted content against a safe white-list, to prevent XSS attacks.
  • Output tidy HTML.

Jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; Jsoup will create a sensible parse tree.

Official Website: http://jsoup.org/

Useful Links:

6785 questions
33
votes
2 answers

Jsoup select div having multiple classes

I am trying to select, using Jsoup, a
that has multiple classes:
...
The syntax for doing so, to the best of my understanding, should…
ef2011
  • 10,431
  • 12
  • 49
  • 67
32
votes
3 answers

Jsoup select and iterate all elements

I will connect to a url through jsoup and get all the contents of it but the thing is if I select like, doc.select("body") its returning a single element but I want to get all the elements in the page and iterate them one by one for…
Karthik
  • 804
  • 4
  • 15
  • 24
30
votes
3 answers

Jsoup Cookies for HTTPS scraping

I am experimenting with this site to gather my username on the welcome page to learn Jsoup and Android. Using the following code Connection.Response res = Jsoup.connect("http://www.mikeportnoy.com/forum/login.aspx") …
Brian
  • 489
  • 3
  • 9
  • 13
29
votes
7 answers

Jsoup.clean without adding html entities

I'm cleaning some text from unwanted HTML tags (such as