XSSFCell in Apache POI encodes certain character sequences as unicode character

Question

XSSFCell seems to encode certain character sequences as unicode characters. How can I prevent this? Do I need to apply some kind of character escaping?

e.g.

cell.setCellValue("LUS_BO_WP_x24B8_AI"); // The cell value now is „LUS_BO_WPⒸAI"

In Unicode Ⓒ is U+24B8

I've already tried setting an ANSI font and setting the cell type to string.

`apache poi` is not to blame for this. It is a `Microsoft` "feature". Seems as if Microsoft uses [VTBString class](https://msdn.microsoft.com/en-us/library/documentformat.openxml.varianttypes.vtbstring%28v=office.15%29.aspx) while parsing the `XML` to get the `Excel` string contents. So try `cell.setCellValue("LUS_BO_WP_x005F_x24B8_AI");`, where `_x005F_` is the underscore "Entity". This is exactly how `Excel` itself stores strings which contains `"_xHHHH_"`. — Axel Richter, Jan 12 '18 at 10:40
This is the way how Excel encodes unicode characters here and thus POI performs the same conversion as well, see https://bz.apache.org/bugzilla/show_bug.cgi?id=57008#c8 for the related issue and a longer explanation. — Matthias Gerth, Jan 12 '18 at 13:37
"This is the way how Excel encodes unicode characters here ": No, the `*.xlsx` Office Open XML default encoding is `UTF-8`. So no need for this special nonsense "encoding" at all. Surely `apache poi` will bearing in mind this while **reading**. But `cell.setCellValue("LUS_BO_WP_x24B8_AI");` does writing "LUS_BO_WP_x24B8_AI" into the file and not "LUS_BO_WPⒸAI". It is `Excel` what makes it to "LUS_BO_WPⒸAI". — Axel Richter, Jan 12 '18 at 14:00

score 0 · Answer 1 · answered Jan 12 '18 at 13:40

0

This character conversion is done in XSSFRichTextString.utfDecode()

I have now written a function that basicaly does the same thing in reverse.

private static final Pattern utfPtrn = Pattern.compile("_(x[0-9A-F]{4}_)");

private static final String UNICODE_CHARACTER_LOW_LINE = "_x005F_";

public static String escape(final String value) {
    if(value == null) return null;

    StringBuffer buf = new StringBuffer();
    Matcher m = utfPtrn.matcher(value);
    int idx = 0;
    while(m.find()) {
        int pos = m.start();
        if( pos > idx) {
            buf.append(value.substring(idx, pos));
        }

        buf.append(UNICODE_CHARACTER_LOW_LINE + m.group(1));

        idx = m.end();
    }
    buf.append(value.substring(idx));
    return buf.toString();
}

answered Jan 12 '18 at 13:40

Matthias Gerth

91
5

Maybe it is a solution for you. But not in general. Because now you distort the data. The file now contains "LUS_BO_WP_x005F_x24B8_AI" and all parsers other than `Excel` (LO/OO Calc for ex.) will get this and not "LUS_BO_WP_x24B8_AI". – Axel Richter Jan 12 '18 at 14:22
@AxelRichter When you type `LUS_BO_WP_x24B8_AI` into a cell in Excel, Excel changes it to `LUS_BO_WP_x005F_x24B8_AI`. So if you use LO/OO Calc or some other parser to open the file, and you have that problem, they do not handle Unicode literals in Microsoft format files the way Microsoft handles them. I would call that a bug in the other parser. – jmarkmurphy Jan 16 '18 at 12:21
@jmarkmurphy : "I would call that a bug in the other parser.": Definitely not. Because neither is that "feature" documented somewhere nor it is useful, at least not for Office Open XML, since the default encoding there is `UTF-8` and so all Unicode **is** possible already. But do convincing me by linking to a documentation for this "feature" possibly showing some useful use cases for this. – Axel Richter Jan 16 '18 at 16:04
@AxelRichter This behavior is documented in the Open Office XML specification. One place is in ECMA-376, Fourth Edition, Part 1 Section 22.4.2.4. In addition this document specifies that unicode characters that are not allowed in XML need to be encoded like _x####_. Looking at the referenced XML spec, there are some Unicode characters that are not a part of the XML spec most notably x1-x8, xB, xC, xE-x1F. There are also some others. These are mostly control characters. – jmarkmurphy Jan 17 '18 at 18:57
@jmarkmurphy: Oh, yes it really is documented there. Thank you for this information and for your effort. I am always not convinced though. This extends the string definition in `XML` by a proprietary (Microsoft only) meaning of `_xHHHH_` within a string. Other users of `XML` are able fulfilling the restrictions without such extensions. They are using other properly defined encodings (Base64 for ex.) if content having control characters in it is needed. So I am always nor looking for some useful use cases for this `_xHHHH_` within a string. – Axel Richter Jan 18 '18 at 05:01
@Matthias: You should extend the regexp at least by "a-f" so that it will also work for lower case words: `utfPtrn = Pattern.compile("_(x[0-9A-Fa-f]{4}_)");` – Jul 22 '21 at 21:23
The use case I currently have is that I want to put some URLs into an xlsx file and decide with some formula in the xlsx file which URL to use. Example URL is this: `https://[domain]/imgs/[idstuff]_kr__xfffd_mer-[idstuff].jpg`. With Matthias' escape method (extended by lower case letters) I can store that URL into a cell (let's say A1) and read the value from exactly that cell A1. BUT reading a referencing cell (B2 with formula `=A1`) returns `[...]kr_�mer[...]`. – Jul 22 '21 at 21:31
Update to my previous comment: Writing `abc_x005F_x005F_x005F_x005F_x005F_x005F_x005F_xfffd_def` into cell `A1` and reading from referencing cell B2 (with formula `=A1`) returns `abc_xfffd_def`, but reading the value from `A1` returns value `abc_x005F_x005F_x005F_xfffd_def`. So the result depends on where to read the value from. – Jul 23 '21 at 05:34

score 0 · Answer 2 · answered Jul 23 '21 at 06:08

Based on what @matthias-gerth suggested with little adaptations:

Create your own XSSFRichTextString class
Adapt XSSFRichTextString.setString like this: st.setT(s); >> st.setT(escape(s));
Adapt the constructor of XSSFRichTextString like this: st.setT(str); >> st.setT(escape(str));

Add this stuff in XSSFRichTextString (which is very near to Matthias suggestion):

 private static final Pattern PATTERN = Pattern.compile("_x[a-fA-F0-9]{4}");
 private static final String UNICODE_CHARACTER_LOW_LINE = "_x005F";

 private String escape(String str) {
     if (str!=null) {
         Matcher m = PATTERN.matcher(str);
         if (m.find()) {
             StringBuffer buf = new StringBuffer();
             int idx = 0;
             do {
                 int pos = m.start();
                 if( pos > idx) {
                     buf.append(str.substring(idx, pos));
                 }
                 buf.append(UNICODE_CHARACTER_LOW_LINE + m.group(0));

                 idx = m.end();
             } while (m.find());
             buf.append(str.substring(idx));
             return buf.toString();
         }
     }
     return str;
 }

XSSFCell in Apache POI encodes certain character sequences as unicode character

2 Answers2

Linked