Questions tagged [jtidy]

JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML. JTidy is maintained by a group of volunteers.

JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.

Official Website: http://jtidy.sourceforge.net/

Useful Links:

97 questions
1
vote
2 answers

Getting Exception on evaluating an XPath expression in Java

I am trying to learn the usage of Xpath expressions with Java. I am using Jtidy to convert the HTML page to XHTML so that I can easily parse it using XPath expressions. I have the following code: DocumentBuilderFactory factory =…
A Beginner
  • 393
  • 2
  • 12
1
vote
1 answer

JTidy can't handle HTML tags inside script element

(This is a followup to a problem I had a few days ago, where JTidy was reporting 3 errors inside a 300k HTML document, but not reporting where. After some grinding on the problem, I found what appears to be causing the error, and I have a strong…
Paul Brinkley
  • 6,283
  • 3
  • 24
  • 33
1
vote
1 answer

Sitemap using sax and webcrawler

Hi everyone this is my first question here and im not a programmer. I would like to generate a sitemap. I am crawling a website with webcrawler (crawler.dev.java.net). Is there any way to use a sax parser for the data i get? I also used jtidy and i…
user452065
  • 17
  • 3
1
vote
0 answers

Stop Jtidy parsing if element is found

Is there any way to only download and parse an XML document until an element is found using an XPathExpression? I'm using Java: url = new URL("http://registroapps.uniandes.edu.co/scripts/adm_con_horario1_joomla.php?depto="+params[0]); Tidy…
Hugo M. Zuleta
  • 572
  • 1
  • 13
  • 27
1
vote
1 answer

Validate HTML code programmatically

I am trying to validate a String of HTML code. That is, when HTML code syntax is wrong I want to know, perhaps in the form of a return false. I am currently using JTidy but it doesn't tell me there was bad syntax it just corrects it. I don't need…
Mike John
  • 818
  • 4
  • 11
  • 29
1
vote
0 answers

Jtidy & ITextRenderer are not giving right output

I have the following code to convert the html to pdf and two intermediate files getting created. File file = new File("file.tmp"); String y1 = "…
1
vote
1 answer

How to set Tidy configuration to not replace
tags?

File file = new File("xxxxxxx"); String y1 = "
"; FileWriter fw = new FileWriter(file); fw.write(y1); fw.close(); FileReader r…
vinod kumar
  • 175
  • 2
  • 3
  • 6
1
vote
1 answer

Remove desired tag from html using JTidy

I am using JTidy and xpath in parsing HTML, but for the time being parsing text causes me a little trouble because it may include b tag inside, so I don't want to loop over it's child nodes but simply remove 'b' tags after it loads html. How can I…
Suhrob Samiev
  • 1,528
  • 1
  • 25
  • 57
1
vote
0 answers

Escaping converting Danish characters by JTidy

I'm using JTidy to parse an HTML page to a XHTML. The HTML contains danish characters then the JTidy converts them in to some specific characters. eg : Word "Observér" is converted to "Observér". Is there a way to avoid this?
user1909157
  • 21
  • 1
  • 4
1
vote
0 answers

Java screen scraping with JTidy - Parsing HTML values

So what I'm trying to accomplish is scraping an IMDB webpage for data from webseries. Problem is when I convert the page to a DOM object and try to get values it's not as easy as it looks. For instance: I use getElementsByTagName("h1") -> it returns…
Mo Binni
  • 265
  • 1
  • 12
1
vote
1 answer

xpaths not working in java

I am trying to access a url, get the html from it and use xpaths to get certain values from it. I am getting the html just fine and Jtidy seems to be cleaning it appropriately. However, when I try to get the desired values using xpaths, I get an…
Bobby
0
votes
1 answer

how to remove error log in Jtidy?

I use code below for jtidy. Tidy tidy = new Tidy(); tidy.setQuiet(true); tidy.setShowWarnings(false); doc = tidy.parseDOM(in, null); it can remove all warning log but i still get error log below line 424 column 20 - Error: is not…
terry
  • 301
  • 3
  • 17
0
votes
5 answers

Fastest way to traverse or find elements in DIV HTML

I am writing an utility which should hit the URL of a dynamic page, retrieve the content, search for a specific div tag in various nested div tags and grab the content. Mainly, I am looking for some Java code/library. JavaScript or some…
Sourabh
  • 1,515
  • 1
  • 14
  • 21
0
votes
1 answer

Comments getting escaped with NekoHTML (or JTidy) + XOM

I'm using NekoHTML to clean up some HTML, and then feeding it to XOM to get an object model. Somewhere in the course of this, comments are getting escaped. Here's a relevant example of the input HTML (most of the cut for clarity):
David Moles
  • 48,006
  • 27
  • 136
  • 235
0
votes
1 answer

Malformed XML/HTML parsing

I need to parse a multiple(read approx 1600) HTML pages and pull out the contents of the following tag from each file. textarea name="line" cols="66" rows="5" class="textbox" id="line" style="font-size:12px;" onkeydown="textCounter()"…
John McDonnell
  • 753
  • 1
  • 8
  • 24