Conversion from HTML to XHTML changes euro symbol, preventing correct XML parsing

Question

I am extracting information from an HTML file, by parsing it using SAX, in Java. The parsing program was given to me, it was already using SAX, so I would like to keep it this way. What I do is the following :

I get the HTML file from a website
transform it into valid XML using the JTidy Library. However this library transforms all the € symbols into "â¬" ---> I get fileXHTML
I feed the file XHTML to the parsing library, so I can extract the data I want (wrote the handlers, the function startElement(), characters() and endElement().

Problem: with that new string for the euro sign, the parsing library won't run. I get the message : "the entity acirc was referenced but not declared"

I just want my euro sign to not be a problem. How do I sort my thing out ?

Thanks everyone,

score 1 · Answer 1 · answered Oct 21 '13 at 11:25

1

The issue you are having is one of encoding.

Some tool, somewhere in your pipeline, is mucking up the encoding, and then that error is carried forwards, creating an â in your output.

From the looks of it, the web site is using UTF-8 (as well it should), but the encoding is either misdeclared, or the declaration is ignored.

Whether it is one of the tools in your toolchain that causes this problem, or if it's misuse of the tools, is not entirely clear.

answered Oct 21 '13 at 11:25

Williham Totland

28,471
6
52
68

OK thanks for the tip about encoding. How can I check if it's a problem in my toolchain for example ? In my HTML file, I have the following div tag in my body :
blabla
Is that normal ? when I validate the HTML as XML, the validator yields an error about that string being in the middle of the document. – Myna Oct 21 '13 at 17:30
1

@Myna: Well, it looks like we found the culprit: The HTML source is bunk. – Williham Totland Oct 22 '13 at 06:22
haha yes. So how can I handle that ? The idea would be to automate my crawling over many pages, by following the hrefs in the start HTML page. How can I run my code, if I have exceptions being raised everytime there is invalid XHTML ? Shall I do things some other way ? I just wanna extract my dataaa – Myna Oct 22 '13 at 11:21
@Myna: When an XML parser handles an error of that magnitude, it's not allowed by the specification to continue on. As for an HTML parser, what exactly they do varies from parser to parser. In any case, that document should be consider irreparably broken, and should be discarded. – Williham Totland Oct 22 '13 at 12:13
A XML parser is not suitable for the job unless you clean your input documents which might be very tricky. Use an HTML parser as suggested above. You would want to go from HTML to XHTML but as said it is hard to do. – Ludovic Kuty Oct 25 '13 at 09:59

score 0 · Answer 2 · answered Oct 21 '13 at 11:22

0

Use the HTML number instead of the actual Euro symbol 

answered Oct 21 '13 at 11:22

SaturnsEye

6,297
10
46
62

Conversion from HTML to XHTML changes euro symbol, preventing correct XML parsing

2 Answers2