HTML parser without tidying the source

Question

I have several hundred old html files on my machine which I am trying to parse and extract some data. I have tried different Java parsers for it including Jsoup, Tagsoup, HTMLcleaner, JTidy etc. Due to the way html code is in files I can only use parsers which support XPATH, tried Jsoup but couldn't find the equivalent css selector.

Anyways, my problem is that whatever parser I try, cleans up the actual content and convert things like ' (apostrophe) to weird characters.

Is it possible to parse the content using any Java parser without tidying and replacing the special chars?

As long as the html files are well formed, you can use any DOM parsing technique. The advantage with these parsers is that they handle ill-formed HTML also, but I guess they would have options to not tamper the content atleast. — Vikdor, Aug 23 '12 at 10:05
But these parser are changing the content and replacing special characters with weird characters. For ex. for this text "Abc’d" the output is coming to be "Abcâ??d". I want to keep the "'" as it is if possible or atleast convert it to it's proper Ascii code. — PTS Admin, Aug 23 '12 at 10:15
That seems to be a problem with text encoding. You might want to stick to one parser that you were comfortable with and just address the encoding issue. — Vikdor, Aug 23 '12 at 10:16
I found a solution for now. I am parsing the half the content using Jtidy (xpath) and half the content using Jsoup (css selectors). This may not be the most efficient as for every page reading and parsing the file twice and code is bloated but it will do for now. Still keen to hear if there's any other better solution. — PTS Admin, Aug 23 '12 at 16:07

HTML parser without tidying the source

0 Answers0