Questions tagged [jsoup]

Jsoup is a Java HTML parser for extracting and manipulating HTML data, using the best of DOM, CSS, and jQuery-like methods.

Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; Jsoup will create a sensible parse tree.

Example

Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the "In the news" section into a list of Elements:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Selecting specific content

The select(...) method is used to select a subset of the Elements from a a Document. This method accepts a CSS selector to specify which elements are selected and returned.

Some examples of usage, after loading or parsing an HTML document:

  • Elements links = doc.select("a[href]")

    This will select any a with a href attribute, i.e. any link on the page.

  • Elements pngs = doc.select("img[src$=.png]")

    This will select any img element where the value of the src attribute ends in .png, so this will select any image which is a PNG image.

This method returns an Elements list which contains all the elements matched by the selector.

There is an introduction on the Jsoup website, and the Javadoc page lists many more advanced possibilities, such as matching by regex, exclusions, pseudo-selectors, etc.

JavaScript support

Jsoup does not currently support JavaScript, which means that pages on which data is loaded with JavaScript will not be available when parsing using Jsoup.

If you want to get such dynamically loaded data, you can:

  • Use an alternative, such as HtmlUnit, Selenium WebDriver or ui4j.

  • Use the website's API, if it offers one,

  • To find out from where the website loads its data, usually all you need to do is send an HTTP request somewhere to get the data as JSON.

Open source

Jsoup is an open source project distributed under the liberal MIT license. The source code is available at GitHub.

jsoup implements The Web Hypertext Application Technology Working Group (WHATWG) HTML5 specification and parses HTML to the same DOM as modern browsers do.

Jsoup can be used to ...

  • Scrape and parse HTML from a URL, file, or string.
  • Find and extract data, using DOM traversal or CSS selectors.
  • Manipulate the HTML elements, attributes, and text.
  • Clean user-submitted content against a safe white-list, to prevent XSS attacks.
  • Output tidy HTML.

Jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; Jsoup will create a sensible parse tree.

Official Website: http://jsoup.org/

Useful Links:

6785 questions
1
vote
1 answer

Jsoup prevent pretty print

If I do something like this with a jsoup Element new Element("div").html("

") The output of the toString() method (or for that case the outerHtml()) method yields something like this.

I cannot find anything in the…
Tom
  • 3,807
  • 4
  • 33
  • 58
1
vote
1 answer

[JSOUP]why 1.6.x remove TD tags,Problems upgrading to 1.6.x

System.out.println(Jsoup.parseBodyFragment("123").html()); jsoup 1.5.2 OUTPUT:
123
jsoup 1.6.x (1.6.0 and 1.6.1)…
Koerr
  • 15,215
  • 28
  • 78
  • 108
1
vote
2 answers

How to get a specific tag from a div class with this html page?

I am trying to retrieve the image url in from this html page The image is inside of the editions box on the webpage. How would i go about getting it using the JSoup selector method. Such as Document doc = Jsoup.connect(url).get(); Element png =…
yoshi24
  • 3,147
  • 8
  • 45
  • 62
1
vote
1 answer

Java scraping website after async scripts are loaded

Little background, I'm trying to given an option for customer to add HTML directly and publish a single page website(like blogspot). This brought scammers problem, so I created a microservice that blocks publishing website based on HTML…
Ijaz
  • 421
  • 1
  • 6
  • 23
1
vote
1 answer

How to avoid Force close with IOException and SocketTimeoutException JSoup?

I am using this code to retreive a html page and parse it while(doc == null && retry<5){ retry++; try { doc = Jsoup.connect(url).get(); } catch (IOException e) { …
yoshi24
  • 3,147
  • 8
  • 45
  • 62
1
vote
0 answers

Download all PDF links from html webpage from java program using jSoup

I want to download all PDFs from the links in this webpage from my java program - https://www.bseindia.com/corporates/ann.html I used jSoup to connect to the website and extract all links but it is not showing the PDF file links. When I open the…
Sprint T
  • 11
  • 1
1
vote
1 answer

NullPointerException in DOM seletor method

I keep getting this error NullPointer 08-16 22:55:46.360: ERROR/AndroidRuntime(11047): Caused by: java.lang.NullPointerException 08-16 22:55:46.360: ERROR/AndroidRuntime(11047): at…
yoshi24
  • 3,147
  • 8
  • 45
  • 62
1
vote
2 answers

How to set a defualt selector for these articles from the same site?

I am trying to retrieve the whole overview section for this url What would be the elements i look for in the three different articles? http://xbox360.gamespy.com/xbox-360/project-dark/ Is there anyway to create a default selector to retrieve the…
coder_For_Life22
  • 26,645
  • 20
  • 86
  • 118
1
vote
2 answers

How to load these elements to populate a list in android?

I am using this to get a list of elements from a webpage. http://www.gamespy.com/index/release.html // Get all the game elements Elements games = doc.select("div.FP_Up_TextWrap b"); // Create new ArrayList ArrayList gameList = new…
coder_For_Life22
  • 26,645
  • 20
  • 86
  • 118
1
vote
1 answer

Jsoup hyperlink scraping not working for some websites

I've been working on a project recently which involves scraping specific products from websites and reporting the availability status(Graphics cards if anyone is curious). Using JSOUP, I've been doing this by going through product listing pages,…
Yashpashar
  • 31
  • 1
  • 9
1
vote
2 answers

How to retrieve list on a webpage and return it to a list in android?

I am trying to retrieve this http://www.vgreleases.com/ReleaseDates/Upcoming.aspx It is a list of upcoming titles. I just want to retrieve the list. How would i go about doing this? I am familiar with JSOUP. Just need a little help with this one.…
coder_For_Life22
  • 26,645
  • 20
  • 86
  • 118
1
vote
2 answers

No number showing when trying to scrape an number with Jsoup

I tried to scrap the current word-population from this website, but there isnt any number in the span there. Im kinda new in pogramming so it would be helpful if someone could answer my question This is my code: package org.jsoup.examples; import…
1
vote
1 answer

(AsyncTask) Open a dialog when catch (Jsoup)

I want the dialog to open when "AsyncTask" is "catch". I tried to call Dialogue into a “catch”. But the program is crashing. How do I open a dialog when there is a catch? My code: public class test extends AsyncTask { …
NONAME
  • 37
  • 7
1
vote
1 answer

Unable to extract data from the internet with any technology - java.net.SocketException: java.security.NoSuchAlgorithmException

For some time now, I have been trying to extract data from the internet using technologies such as: HtmlUnit Java Selenium JSoup Apache HttpClient Java 11, java.net.http Problem I never had a problem doing data extraction before. Everything always…
Loa
  • 2,117
  • 3
  • 21
  • 45
1
vote
1 answer

Jsoup stops parsing a webpage

Jsoup.parse(String html) stops working. I have an application when i use jsoup for few times to parse different pages, but when i want to parse a big page, jsoup just stops and that is all. Does it have a limit or a maximum size of a…
user849998