-1

I have a text "Begünstigter" which I'm trying to escape the character 'ü' with StringEscapeUtils.escapeXml. As the code for 'ü' is ü, I would expect the method to return Begünstigter. However, StringEscapeUtils.escapeXml is somehow doing the escape until there is no character to escape anymore, meaning after having the value Begünstigter, it escapes & as &. That's why the final result I get becomes Begünstigter. I've tried using commons-text, commons-lang, commons-lang3 with escapeXml10 and escapeXml11 methods as well as some other posted solutions. But nothing seems to work for me. What am I overlooking here, how can I solve this issue?

Here is the full code of where I'm doing this:

private void exportRecords(XMLStreamWriter writer, XmlExportDataDescription exportDataDescription) throws XMLStreamException {
        Long companyId = exportDataDescription.getCompanyId();
        String mainTagName = exportDataDescription.getMainTagNameInXml();

        long count = 0;

        Clock clock = Clock.systemDefaultZone();
        writer.writeStartElement(mainTagName);
        while (true) {
            Map<String, Object> parameter = new HashMap<>();
            parameter.put("companyId", companyId);
            parameter.put("offset", count + 1);
            parameter.put("rowNum", count + MANUAL_XML_CREATION_BATCH_SIZE);

            long startTimeResults = clock.millis();
            List<Map<String, Object>> resultList = getSqlMapClientTemplate().queryForList("XML_EXPORT." + mainTagName, parameter);
            long endTimeResults = clock.millis();

            if (resultList.isEmpty()) {
                break;
            }

            log.debug("---- Retrieving " + resultList.size() + " results for table " + exportDataDescription.getMainTagNameInXml() + " took " + (endTimeResults - startTimeResults) + " ms");

            count += resultList.size();

            long startTimeBatchWriting = clock.millis();
            for (Map<String, Object> listEntry : resultList) {
                writer.writeStartElement(mainTagName + "_ROW");

                for (Entry<String, Object> entry : listEntry.entrySet()) {
                    if (entry.getKey().toLowerCase().equals("rn")) {
                        continue;
                    }

                    if (entry.getValue() == null) {
                        writer.writeEmptyElement(entry.getKey());
                    } else {
                        writer.writeStartElement(entry.getKey());
                        writer.writeCharacters(StringEscapeUtils.escapeXml(entry.getValue().toString()));
                        writer.writeEndElement();
                    }
                }

                writer.writeEndElement();
            }

            long endTimeBatchWriting = clock.millis();
            log.debug("---- Writing batch results for table " + exportDataDescription.getMainTagNameInXml() + " took " + (endTimeBatchWriting - startTimeBatchWriting) + " ms");
        }

        writer.writeEndElement();
        exportDataDescription.setNumberOfDatasets(BigDecimal.valueOf(count));
    }
Eda
  • 93
  • 9
  • Please provide a [mcve] - we can't tell whether you're doing something odd like calling `escapeXml` twice. – Jon Skeet Aug 30 '23 at 14:53
  • @JonSkeet just added my code – Eda Aug 30 '23 at 14:58
  • afaik that string doesn't need escaping for xml. If you were to do an html escape on it, you'd get `Begünstigter` – g00se Aug 30 '23 at 15:02
  • 2
    I think since you are using an XML writer to write the string, you don't need to escape it. But you didn't specify the library you are using. Your code is not an [mcve]. – RealSkeptic Aug 30 '23 at 15:06
  • @RealSkeptic I'm using `org.apache.commons.lang.StringEscapeUtils`. I also need to escape characters <, > in other exported fields. So I cannot simply remove this method. – Eda Aug 30 '23 at 15:13
  • That isn't a [mcve]. Ideally, you'd provide us a piece of code we can copy, paste, compile and run to see the problem. – Jon Skeet Aug 30 '23 at 15:25
  • *I also need to escape characters <, > in other exported fields* Those are `<` and `gt;` If you escape a string with umlauts in it using `StringEscapeUtils`, you'll get the same string back, because they don't need escaping – g00se Aug 30 '23 at 15:27
  • 1
    Why do you think you need to escape anything before calling the `XMLStreamWriter` `writeCharacters` method? Don't you think the XMLStreamWriter knows how to properly escape XML on its own? Also, what is all this log, Clock, timing, SqlMap blah blah blah code? It doesn't have anything to do with your question. Create a minimal program that just writes the string in question to an XMLStreamWriter plus any escaping. That's a minimal, reproducible example. – David Conrad Aug 30 '23 at 15:37
  • 2
    Here's a minrep for the API you asked about: `public static void main(String[] args) throws Exception {String s = "Begünstigter";if (args.length > 0) {s = args[0];}System.out.println(StringEscapeUtils.escapeXml10(s));}` You'll get the same string out again with no arguments. But @DavidConrad is almost certainly right - if the writer doesn't know, unaided, that `<` needs escapement then I'd be very surprised – g00se Aug 30 '23 at 15:38
  • 1
    @g00se Yes. I just tested it and if the string is `""` it produces a result like `<Begünstigter>`. No prior escaping, just calling `writeCharacters("")` directly. – David Conrad Aug 30 '23 at 15:47

2 Answers2

3

Here is a minimal, reproducible example that show escaping is not necessary before calling XMLStreamWriter::writeCharacters:

import java.io.StringWriter;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamWriter;

StringWriter sw = new StringWriter();
XMLStreamWriter writer = XMLOutputFactory.newInstance().createXMLStreamWriter(sw);
writer.writeStartDocument();
writer.writeStartElement("value");
writer.writeCharacters("<Begünstigter>");
writer.writeEndElement();
writer.writeEndDocument();
writer.close();
System.out.println(sw.toString());

You can run this on JShell and the output is:

"<?xml version=\"1.0\" ?><value>&lt;Begünstigter&gt;</value>"

In short, XMLStreamWriter already knows how to write XML. You do not need to, and should not, escape text before passing it to the writeCharacters method.

Note: some implementations might only escape the < (left angle bracket) and not the > (right angle bracket); the former is required to be encoded while the latter is optional, but the result will still be correctly encoded and will be parsed correctly by an XML parser.

David Conrad
  • 15,432
  • 2
  • 42
  • 54
  • I just tried `writer.writeCharacters` with `` on my local system and it returns `<VT><AT><AK><VK><FK>` . – Eda Aug 30 '23 at 16:22
  • 3
    @Eda that is correctly escaped. '<' is _required_ to be escaped in xml but '>' is _optional_. it looks weird, but any valid xml parser should parse that correctly. – jtahlborn Aug 30 '23 at 17:25
-2

One way to handle it is to unescape the parts that you do not want to escape

writer.writeCharacters(
     StringEscapeUtils.escapeXml(
       entry.getValue().toString()
     ).replaceAll("&amp;#(\\d+);", "&#$1;")
  );

Replacing all the &amp; with &

snaikar
  • 415
  • 4
  • 12