0

The answer for most Character encoding is Apache StringEscapeUtils in the commons.text version. Agree. Can be used to escape the strings between the xml tags. But how do I escape the xml tokens themselves?

Allowed chars are simple: https://www.w3.org/TR/xml11/#sec-common-syn

My use case is that I convert a database table into an XML where each column name is one xml token.

<ROW><COL1>Hello</COL1></ROW>

Works fine but what if the column name is "/BIC/COL1"?

<ROW></BIC/COL1>Hello<//BIC/COL1></ROW>

is obviously not valid. Currently I do not even have a plan on how the encoding might look like. Would need to use a _x26BIC_x26COL1 tag name or something similar.

Anything I overlook?

Werner Daehn
  • 577
  • 6
  • 16
  • Please [edit] your question to include the API you are using to create the XML document. Also add the current source code you have to build the XML document. – Progman Apr 28 '20 at 17:27
  • That would not help. I use olingo v4 to create odata documents. Besides, the question is generic. No matter how you create the XML document, if the generator does not escape the xml tag names and allows for any strings, I have to encode it. In that sense, the source code would be out.println(""); – Werner Daehn Apr 29 '20 at 08:43
  • 1
    This is not a character encoding problem. And no, I don't agree that StringEscapeUtils from Apache is the answer for most encoding issues; never used it despite lot's of escaping issues in many network protocols. Otherwise, @Progman's second suggestion is probably the way forward. – forty-two Apr 29 '20 at 21:54

1 Answers1

0

There is no string escaping mechanism for the XML element tag. Some APIs will even reject the name for the new element when it doesn't match the specification for element names. There are at least two possible solutions to your problem:

  1. You can define your own escape mechanism which you use to encode and decode the element name. As an example you could use _ as the escape sequence. The sequence __ (two underscores) will be a literal _ and the sequence _XX or _uXXXX will be the ascii/unicode character you want to write.

  2. You save the column name in an attribute. This way you can save every value in it and even use the XML API of your choice to save the value with the proper encoding.

Progman
  • 16,827
  • 6
  • 33
  • 48
  • Good idea. I did both. When generating the odata metadata with the list of properties I added an annotation so that for each property I know the columnname it belongs to. And for the propertyname I use the __uXXXX pattern. – Werner Daehn Apr 30 '20 at 17:28