I am extracting information from an HTML file, by parsing it using SAX, in Java. The parsing program was given to me, it was already using SAX, so I would like to keep it this way. What I do is the following :
- I get the HTML file from a website
- transform it into valid XML using the JTidy Library. However this library transforms all the € symbols into "â¬" ---> I get fileXHTML
- I feed the file XHTML to the parsing library, so I can extract the data I want (wrote the handlers, the function startElement(), characters() and endElement().
Problem: with that new string for the euro sign, the parsing library won't run. I get the message : "the entity acirc was referenced but not declared"
I just want my euro sign to not be a problem. How do I sort my thing out ?
Thanks everyone,