I am trying to parse an HTML file using JTidy, but it seems to ignore the content of the file in the output, although the output log shows the JTidy going through the content of the file.
public static void Main(String args[]) throws FileNotFoundException, UnsupportedEncodingException {
File file = new File("C:\folder\file.html");
InputStream in = inputStream(file);
OutputStream out = null;
Document doc = cleanData(in, out);
}
public static Document cleanData(InputStream in, OutputStream out) throws UnsupportedEncodingException {
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.setQuiet(true);
tidy.setShowWarnings(false);
tidy.setForceOutput(true);
tidy.parseDOM(in, out);
Document dom = tidy.parseDOM(in, out);
return dom;
}
public static InputStream inputStream(File file) throws FileNotFoundException {
FileInputStream fis = new FileInputStream(file);
return fis;
}
but it only outputs
<?xml version="1.0" encoding="UTF-8" standalone="no"?><html xmlns=""><head><meta content="HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net" name="generator"/><title/></head><body/></html>
does anybody know what I am doing wrong?