Questions tagged [jsoup]

Jsoup is a Java HTML parser for extracting and manipulating HTML data, using the best of DOM, CSS, and jQuery-like methods.

Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; Jsoup will create a sensible parse tree.

Example

Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the "In the news" section into a list of Elements:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Selecting specific content

The select(...) method is used to select a subset of the Elements from a a Document. This method accepts a CSS selector to specify which elements are selected and returned.

Some examples of usage, after loading or parsing an HTML document:

  • Elements links = doc.select("a[href]")

    This will select any a with a href attribute, i.e. any link on the page.

  • Elements pngs = doc.select("img[src$=.png]")

    This will select any img element where the value of the src attribute ends in .png, so this will select any image which is a PNG image.

This method returns an Elements list which contains all the elements matched by the selector.

There is an introduction on the Jsoup website, and the Javadoc page lists many more advanced possibilities, such as matching by regex, exclusions, pseudo-selectors, etc.

JavaScript support

Jsoup does not currently support JavaScript, which means that pages on which data is loaded with JavaScript will not be available when parsing using Jsoup.

If you want to get such dynamically loaded data, you can:

  • Use an alternative, such as HtmlUnit, Selenium WebDriver or ui4j.

  • Use the website's API, if it offers one,

  • To find out from where the website loads its data, usually all you need to do is send an HTTP request somewhere to get the data as JSON.

Open source

Jsoup is an open source project distributed under the liberal MIT license. The source code is available at GitHub.

jsoup implements The Web Hypertext Application Technology Working Group (WHATWG) HTML5 specification and parses HTML to the same DOM as modern browsers do.

Jsoup can be used to ...

  • Scrape and parse HTML from a URL, file, or string.
  • Find and extract data, using DOM traversal or CSS selectors.
  • Manipulate the HTML elements, attributes, and text.
  • Clean user-submitted content against a safe white-list, to prevent XSS attacks.
  • Output tidy HTML.

Jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; Jsoup will create a sensible parse tree.

Official Website: http://jsoup.org/

Useful Links:

6785 questions
1
vote
2 answers

What are some good java libraries to search and scrape data out of a web page.

What are some good open source java libraries to search and scrape data out of a web page and stick it into a database. For example, suppose I had a page such as: Address: 123 My Street …
JStark
  • 2,788
  • 2
  • 29
  • 37
1
vote
1 answer

Get thrown out of my for loop when gathering list entries from website using JSoup

My objective is to extract a list of ingredients from a recipe page using JSoup. I managed to get my first list entry from the website fine, however my for loop seems to stop at the first entry without gathering the next 5. I'm not sure what I'm…
Zeroid
  • 13
  • 2
1
vote
2 answers

Get HTML nodes that have the same parent - JAVA

I have a document containing several forms similar to the example posted below. I want to extract all the name/value pairs from the hidden input fields of one of the forms, the form is identified by its name and I don't know in advance how many…
Holm
  • 982
  • 2
  • 12
  • 20
1
vote
1 answer

Prevent Jsoup Element object tags lowercase

I am korean. I don`t speak english very well. I like Jsoup. I need XML(soap) parse in my project. I understand Parser.xmlParser() of Document object. But when I use Element object change lowercase character. ex) Element element = new Element("TEST")…
pminmin
  • 11
  • 2
1
vote
1 answer

How to get actual source code without compromising case and line break?

I am using jsoup to get source code. I am using jsoup version 1.13.1. when I get the source code using below code I found that the case is converted to lowercase. Document doc = Jsoup.connect("https://example.com").get(); webview.loadData(doc); I…
1
vote
1 answer

How to prevent Jsoup from unescaping html?

I am parsing an html string with Jsoup in order to extract just the text, and want to get the exact text, but when I parse strings that include escaped chars Jsoup unescapes them. For example - if I parse

Let's try

Jsoup returns

Let's…

Adi Gutner
  • 31
  • 3
1
vote
0 answers

Highlight a text in Html using Java

I am trying to add highlight to search words in an HTML String in Java. I am parsing the HTML using JSoup and iterating over each text node to find the search words, then adding mark tag around the word. As I am parsing each node sequentially, there…
ProGamer
  • 420
  • 5
  • 16
1
vote
0 answers

trying to webscrape poocoin.app with jsoup

im currently making a discord bot that checks shitcoin prices. when trying to use Jsoup to scrape the page im getting the cloudflare ddos-protection message instead of showing me the content of the site. Can i fix this? public class Scraping…
mobertooo
  • 11
  • 1
1
vote
2 answers

What's the difference between Retrofit call and Jsoup connection? How is it possible they return different responses with same URL?

I'm working on refactoring my university's mobile applications API. The main idea was to move it from Jsoup to Retrofit because of the better classes structure (and because Google recommends so). I found that previous version was built around this…
vo1d
  • 11
  • 3
1
vote
0 answers

Get elements that are not a certain type using JSoup

My DOM: I want to get the "but this is the only text i want to select" without also selecting "some text". After looking through other…
Delected
  • 41
  • 4
1
vote
0 answers

How can I get div to text using jsoup?

I'm trying to get number of players from this website to string but it's not working. Here is my code: public class Main { public static void main(String[] args) throws IOException { String url =…
Raptor
  • 11
  • 1
1
vote
3 answers

Libgdx: How to show HTML text in a label?

I have a string like this: "noun
an expression of greeting
- every morning they exchanged polite hellos
•• Syn: hullo, hi, howdy, how-do-you-do" want to show it in a label as a rich text. for…
Hadi Ahmadi
  • 1,924
  • 2
  • 17
  • 38
1
vote
2 answers

Input values to textfields, stream back to website

I connected to a website, used JSoup to find the "textfield" ID's, input the values, now i need to stream it out. Can someone please help me with the correct coding to stream the "modified" doc back to the website? if (source == enter2) { …
Foxticity
  • 11
  • 2
1
vote
2 answers

Help with Jsoup usage in Android

I am trying to connect to a website and pull off some specific information. I was using HTMLCleaner and xpath but it doesnt seem to support all the xpath queries I need. I am trying to use Jsoup now, after reading the good reviews. But the problem…
user841811
  • 11
  • 1
  • 2
1
vote
2 answers

How to filter out certain links using Jsoup

I am creating a simple program that takes a search input from the user and the number of links that they want to receive. However, my code has 2 problems. When I print out the links, it also includes links for images, news, etc... I was wondering…
1 2 3
99
100