2

The default encoding in Office Open XML is UTF-8. So Unicode is already possible. Nevertheless does Microsoft defining: ECMA-376 Part 1 22.4 Variant Types 22.4.2.4 bstr (Basic String):

22.4.2.4 bstr (Basic String)

This element defines a binary basic string variant type, which can store any valid Unicode character. Unicode characters that cannot be directly represented in XML as defined by the XML 1.0 specification, shall be escaped using the Unicode numerical character representation escape character format _xHHHH_, where H represents a hexadecimal character in the character's value. [Example: The Unicode character 8 is not permitted in an XML 1.0 document, so it shall be escaped as _x0008_. end example] To store the literal form of an escape sequence, the initial underscore shall itself be escaped (i.e. stored as _x005F_). [Example: The string literal _x0008_ would be stored as _x005F_x0008_. end example]

The possible values for this element are defined by the W3C XML Schema string datatype.

This extends the W3C XML Schema string datatype. So that the character sequence _xHHHH_ does have a special meaning as a kind of entity like &#xHHHH;. And that means that everyone who needs parsing Office Open XML (*.xlsx, *.docx, *.pptx) must bearing in mind this while parsing. For example if you put "Text _x1234_ text" into an Excel cell, then Excel does storing this as "Text _x005F_x1234_ text" in the XML. So the string stored in the file is different from the string which was entered and also is different from the string which Excel will showing in the cell. For example if you put "Text _x1234_ text" as string cell content into the XML, then Excel will showing "Text ሴ text" into the cell.

See: XSSFCell in Apache POI encodes certain character sequences as unicode character

It is clear to me that XML 1.0 does having some characters that cannot be directly represented in XML. But this are control characters and other users of XML are able fulfilling the restrictions without such extensions. They are using other properly defined encodings (Base64 for ex.) if content having control characters in it is needed.

So I am always nor looking for some useful use cases for this _xHHHH_ within a string.

Questions:

  1. Can someone enlighting me why this special Unicode numerical character representation escape character format _xHHHH_ in Office Open XML is necessary at all?

  2. Can someone giving any useful use cases for this _xHHHH_ within a string?

Axel Richter
  • 56,077
  • 6
  • 60
  • 87
  • I'm not sure how useful it is but there's an example [here](https://stackoverflow.com/questions/43094662/excel-accepts-some-characters-whereas-openxml-has-error/43141040#43141040) where somebody wanted to output `0x1f`. – petelids Jan 18 '18 at 09:26
  • @petelids: Thank you, but this is exactly what I mean as not useful. So you can forcing `Excel` having that kind of separator within cell content. But `Excel` is not able handling this kind of separator within cell content properly. So if someone tries editing that cell content in `Excel`, then this one will be confused by wild jumping the cursor and occurence of typed letters on unexpected positions. Then probably he will deleting the cell content and retyping. So all effort with this `_x001F_` will be lost. – Axel Richter Jan 18 '18 at 09:57
  • You seem to be asking a "why did they design it like this?" question. That's not the kind of question that SO is designed for, so I'm going to vote to close. It's also a very hard sort of question to answer, because you can never find out what criteria the designers were using when making their decisions: they are probably not the criteria you would use. – Michael Kay Jan 18 '18 at 11:52
  • I'm voting to close this question as off-topic because the question is asking why a software product was designed the way it is, rather than asking for a solution to a programming problem. – Michael Kay Jan 18 '18 at 11:54

1 Answers1

1

As an use case, our all DB is isolated as an requirement and we need to test some jobs/crons/webservices on different DB's, now we need to export some data in an excel and feed to the job as an input file for another DB to check if it's working as expected. Our architecture is required this due to some privileges restriction.

Hope it will be an useful case for you :)

Shubham Jain
  • 16,610
  • 15
  • 78
  • 125
  • 1
    Ok, having Excel files as data exchange file format... Nothing what I would do because there are multiple other issues with this, not only the control character problem. For ex.: Cell content like `00123` becomes `123` without leading zeros if Excel takes it as numeric. Cell content like `09/11/2001` becomes the date November 09 2001 if it is string content and Excel does interpreting it as date having D/M/Y as locale setting.... and so on and so forth. – Axel Richter Sep 06 '18 at 08:30