2

I'm working with JDOM at the moment. I can't think of a solution which what should essentially be an easy problem.

I have a valid XHTML string:

<b>M&amp;A</b> &euro;

How do I insert this into the XML DOM as follows?

<parentNode>
  <b>M&amp;A</b>
  €
</parentNode>

(this XML then goes off to an XSL transformer, which then renders XHTML for the browser)

I've come up with the following 'pseudo' solutions, but I'm not sure if they're possible:

Unescape entities which aren't XML entities, then insert.
Reinscape only XML entites, then HTML unescape the entire string, then insert.

Taras

Trent
  • 2,328
  • 3
  • 33
  • 51

3 Answers3

2

I guess you can use JTidy to transform named entities to numbered ones. After that, the XHTML is also valid XML.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • This is what I ended up doing: * Parse input XHTML fragment as a HTML into a DOM using JTidy * Extract all child nodes of body using xpath (/html/body/node()) * Insert extracted nodes into target XML DOM The only caveat was that ' is a valid XHTML entity, yet not a valid HTML one. This meant that the first step wouldn't treat the sequence: ' as an apostrophe, but rather as 6 individual characters. I fixed this by replacing all instances of ' with the numeric reference (bit of a hack, but it works) – Trent Jun 14 '09 at 12:16
  • I am sure there is a way to tell JTidy to replace all named entity references to numbered ones. On the command line this is "-n". There is also a switch to make it produce valid XML. I would think that the Java library can do the same thing. – Tomalak Jun 14 '09 at 12:27
  • Sorry, the spacing got a bit messed up above. I did find the -n property in JTidy, however, I couldn't find an option for it to parse XHTML instead of HTML - it parses the input as HTML, which means that it doesn't recognise the ' entity. I actually had a look at the source to see if I could add an entity, but no luck. In fact I found the source code responsible for defining the entities (EntityTable), and discovered that ' was not defined (the other 252 HTML entities were – Trent Jun 15 '09 at 09:57
0

While &euro; is valid XHTML entity it is not valid XML one.

Unfortunately, I don't know anything about JDOM, but if it is possible you may try adding DTD entity declarations like <!ENTITY euro "€">. And, maybe, put all XHTML tags into their proper namespace (<parentNode xmlns:x="http://www.w3.org/1999/xhtml"><x:b>...</x:b></parentNode>)

drdaeman
  • 11,159
  • 7
  • 59
  • 104
  • That solution was considered, however we would have to do this for all possibly HTML (XHTML?) entites - http://www.cookwood.com/html/extras/entities.html – Trent Jun 12 '09 at 09:35
0

Create a string containing

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>

+

your XHTML content, in this case <b>M&amp;A</b> &euro;

+

</html>

and then parse this string to obtain a document. Then get all the content inside the root element, that will be your XHTML content and place it inside your parentNode element. You may need to take into account that the content comes from a different document.

George Bina
  • 1,171
  • 7
  • 7
  • 1
    I tried this approach and ran into the problem that when you try to parse the string into the document, because &eruo; is not a XML entity, the string essentially contains an unescaped ampersand, which is invalid XML. – Trent Jun 12 '09 at 09:31