I have several hundred old html files on my machine which I am trying to parse and extract some data. I have tried different Java parsers for it including Jsoup, Tagsoup, HTMLcleaner, JTidy etc. Due to the way html code is in files I can only use parsers which support XPATH, tried Jsoup but couldn't find the equivalent css selector.
Anyways, my problem is that whatever parser I try, cleans up the actual content and convert things like ' (apostrophe) to weird characters.
Is it possible to parse the content using any Java parser without tidying and replacing the special chars?