Questions tagged [jsoup]

Jsoup is a Java HTML parser for extracting and manipulating HTML data, using the best of DOM, CSS, and jQuery-like methods.

Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; Jsoup will create a sensible parse tree.

Example

Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the "In the news" section into a list of Elements:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Selecting specific content

The select(...) method is used to select a subset of the Elements from a a Document. This method accepts a CSS selector to specify which elements are selected and returned.

Some examples of usage, after loading or parsing an HTML document:

  • Elements links = doc.select("a[href]")

    This will select any a with a href attribute, i.e. any link on the page.

  • Elements pngs = doc.select("img[src$=.png]")

    This will select any img element where the value of the src attribute ends in .png, so this will select any image which is a PNG image.

This method returns an Elements list which contains all the elements matched by the selector.

There is an introduction on the Jsoup website, and the Javadoc page lists many more advanced possibilities, such as matching by regex, exclusions, pseudo-selectors, etc.

JavaScript support

Jsoup does not currently support JavaScript, which means that pages on which data is loaded with JavaScript will not be available when parsing using Jsoup.

If you want to get such dynamically loaded data, you can:

  • Use an alternative, such as HtmlUnit, Selenium WebDriver or ui4j.

  • Use the website's API, if it offers one,

  • To find out from where the website loads its data, usually all you need to do is send an HTTP request somewhere to get the data as JSON.

Open source

Jsoup is an open source project distributed under the liberal MIT license. The source code is available at GitHub.

jsoup implements The Web Hypertext Application Technology Working Group (WHATWG) HTML5 specification and parses HTML to the same DOM as modern browsers do.

Jsoup can be used to ...

  • Scrape and parse HTML from a URL, file, or string.
  • Find and extract data, using DOM traversal or CSS selectors.
  • Manipulate the HTML elements, attributes, and text.
  • Clean user-submitted content against a safe white-list, to prevent XSS attacks.
  • Output tidy HTML.

Jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; Jsoup will create a sensible parse tree.

Official Website: http://jsoup.org/

Useful Links:

6785 questions
1
vote
1 answer

Jsoup getElementById throws a NullPointerException only on GCloud VM

The jar run without errors on my local machine but when I load it on a Gcloud VM, Jsoup getElementById throws a NPE. Document doc = Jsoup.connect(url).get(); works properly on both machines and I can print the doc, but Element section =…
Alberto
  • 11
  • 2
1
vote
1 answer

SocketTimeoutException: Read timed out

I have a web application that read HTML pages using the following command Document doc = Jsoup.connect(url).post(); then my web application will redisplay the HTML page again with some modification It works fine and it reads any HTML page that I…
Bachayer
  • 37
  • 2
  • 7
1
vote
1 answer

How to get text from this html page with jsoup?

I am using this code to retreive the text in the main article on this page. public class HtmlparserExampleActivity extends Activity { String outputtext; TagFindingVisitor visitor; Parser parser = null; private static final String TAG =…
android_king22
  • 759
  • 2
  • 20
  • 36
1
vote
2 answers

Add https to missing strings of an array?

I'm writing an app for a client who doesn't have an official API but wants the app to extract video links from his website so I wrote a logic using jsoup. Everything seems to work fine except some of the links don't start with https so I'm trying to…
Meggan Sam
  • 319
  • 1
  • 4
  • 17
1
vote
2 answers

Help scraping HTML with JSoup

Little bit of a beginner here, working on a personal project to scrape my schools course offerings into a easy-to-read tabular format, but am having trouble with the initial step of scraping the data from the site. I just added the JSoup library to…
asolanki
  • 37
  • 1
  • 7
1
vote
1 answer

extracting a element from jsoup for a text value match in the element attribute

How do I get the span with a certain text within an attribute? I am trying to extract the number that comes after the text "stars". So how can I select a span tag that has text "rating_sprite stars" and I want the value "star5" to be extracted from…
serah
  • 2,057
  • 7
  • 36
  • 56
1
vote
1 answer

How to parse unstructured data (i.e. from an HTML directory listing) using JSOUP?

As an example https://download.bls.gov/pub/time.series/ shows date/ timestamp / filesize information that doesn't appear to be enclosed by HTML tags. If we'd like to consider the date and timestamp information related to each link, what are ideal…
discord
  • 59
  • 10
1
vote
0 answers

Android - Parsing website from webview with jsoup

private final Handler uiHandler = new Handler(); private class JSHtmlInterface { @android.webkit.JavascriptInterface public void showHTML(String html) { final String htmlContent = html; uiHandler.post( new…
leonscot
  • 11
  • 2
1
vote
1 answer

How to convert xhtml to html in java?

I converted an html string to xhtml in java using jsoup as shown here: Is it possible to convert HTML into XHTML with Jsoup 1.8.1? However, I couldn't find a way to do the opposite, I mean, convert xhtml to html; is there a way to do this in java?
Louabk
  • 11
  • 1
1
vote
0 answers

JSOUP Unable to parse html string to document if a node value contains like

Try to covert the HTML String to document through JSoup and failing with invalid XML characters. This error possibly can happen when user copy email address from outlook. It looks like, JSoup possibly could fail if a text with special characters…
Shaan
  • 588
  • 1
  • 4
  • 15
1
vote
2 answers

Why aren't UTF-8 characters being rendered correctly in this web page (generated with JSoup)?

I'm having trouble dealing with Charsets while parsing and rendering a page using the JSoup library. here is an example of the page it renders: http://dl.dropbox.com/u/13093/charset-problem.html As you can see, where there should be ' characters,…
sanity
  • 35,347
  • 40
  • 135
  • 226
1
vote
1 answer

Query train arrivals/departures from https://enquiry.indianrail.gov.in/mntes

i am trying to get list of stops(stations) for a train by passing train number and other required parameters(got from web developer tools-firefox) with the url(POST method), but i get 404-page not found error code. when i tried with POSTMAN, it gets…
IronFist
  • 47
  • 4
1
vote
1 answer

How to resolve dependency version conflicts (NoSuchMethodError's)

In my Spring 3.0.5 Web MVC application I have defined a model class with a property annotated with @SafeHtml. When Spring tries to validate this model object, it blows up with the following: HTTP ERROR:…
Peter Perháč
  • 20,434
  • 21
  • 120
  • 152
1
vote
1 answer

400 Http Errors Using Jsoup in Multithreaded Program

I've created a program that parses html pages. I use jsoup connect function within a callable class inside ThreadPool. The problem is that I'm connecting to the same website and with a thread pool size of 5+, I get IO Exceptions - 400 errors. How do…
samwise
  • 269
  • 2
  • 5
  • 13
1
vote
2 answers

Jsoup eats extra information of DocType if it includes a linebreak

When I want to println a downloaded file using Jsoup some information from the DocType are missing if there is a linebreak in it. Is this intended or is this a bug? For example: The DocType looks like that:
Bene
  • 41
  • 2
1 2 3
99
100