API or Method to Replace all non-latin-1 characters

Question

I'm dealing with a 3rd party API / Web Service and they only allow latin-1 character set in their XML. Is there an existing API / method that will find and replace all non-latin-1 characters in a String?

For example: Kévin

Is there anyway to make that Kevin?

So you don't want `Kévin`, to remove them from the byte stream (as you open it in a simple text editor), but actually remove them from the XMl infoset (as an XML parser reads it to an application) as well? — MvG, Jun 27 '12 at 18:05
é is defined in latin-1 (code point 233). Are you sure it's not ASCII you want? — Aleksander Blomskøld, Jun 27 '12 at 18:09

score 2 · Accepted Answer · answered Jun 27 '12 at 18:04

2

Using ICU4J,

public String removeAccents(String text) {
    return Normalizer.decompose(text, false, 0)
                 .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

I found this example at http://glaforge.appspot.com/article/how-to-remove-accents-from-a-string

In java 1.6 the necessary normalizer might be built-in.

answered Jun 27 '12 at 18:04

bmargulies

97,814
39
186
310

A very good first start, at least for the given example. However there are no guarantees that the result will be ASCII-only, as there are Unicode codepoints which do not decompose into an ASCII character and combining diacritics. – MvG Jun 27 '12 at 18:06
If you really want to cope with all the obscure cases, you're going to have to write code. Possibly an ICU4J transcoder object. There's nothing I know of that has all the wierd possibility, like turning ℃ DEGREE CELSIUS into C, or Ł LATIN CAPITAL LETTER L WITH STROKE to L. – bmargulies Jun 27 '12 at 18:18
Or detect unconvertible codepoints and respond appropriately, by removing them, aborting the operation, asking the user or whatever. – MvG Jun 27 '12 at 18:27

score 0 · Answer 2 · answered Nov 11 '21 at 09:50

I've come across lots of posts on how to remove all accents. This (old!) post covers my use case, so I'll share my solution here. In my case, I only want to replace characters not present in the ISO-8859-1 charset. The use case is: read a UTF-8 file, and write it to a ISO-8859-1 file, while retaining as many of the special characters as possible (but prevent UnmappableCharacterException).

@Test
void proofOfConcept() {
    final String input = "Bełchatöw";
    final String expected = "Belchatöw";
    final String actual = MyMapper.utf8ToLatin1(input);
    Assertions.assertEquals(expected, actual);
}

Normalizer seems interesting, but I only found ways to remove all accents.

public static String utf8ToLatin1(final String input) {
    return Normalizer.normalize(input, Normalizer.Form.NFD)
        .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

Weirdly, the above code not only fails, but with

expected: <Belchatöw> but was: <Bełchatow>

CharsetEncoder seems interesting, but it appears I can only set a static "replacement" character (actually: byte array), so all unmappable characters become '?' or similar

public static String utf8ToLatin1(final String input) throws CharacterCodingException {
    final ByteBuffer byteBuffer = StandardCharsets.ISO_8859_1.newEncoder()
        .onMalformedInput(CodingErrorAction.REPLACE)
        .onUnmappableCharacter(CodingErrorAction.REPLACE)
        .replaceWith(new byte[] { (byte) '?' })
        .encode(CharBuffer.wrap(input));
    return new String(byteBuffer.array(), StandardCharsets.ISO_8859_1);
}

Fails with

expected: <Belchatöw> but was: <Be?chatöw>

My final solution is thus:

public static String utf8ToLatin1(final String input) {
    final Map<String, String> characterMap = new HashMap<>();
    characterMap.put("ł", "l");
    characterMap.put("Ł", "L");
    characterMap.put("œ", "ö");
    final StringBuffer resultBuffer = new StringBuffer();
    final Matcher matcher = Pattern.compile("[^\\p{InBasic_Latin}\\p{InLatin-1Supplement}]").matcher(input);
    while (matcher.find()) {
        matcher.appendReplacement(resultBuffer,
            characterMap.computeIfAbsent(matcher.group(),
                s -> Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "")));
    }
    matcher.appendTail(resultBuffer);
    return resultBuffer.toString();
}

A few points:

The characterMap needs to be extended to your needs. The Normalizer is useful for accented characters, but you might have others. Also, extract characterMap out (beware that computeIfAbsent updates the map, beware of concurrency!)
Pattern.compile() shouldn't be called repeatedly, extract that out to a static

API or Method to Replace all non-latin-1 characters

2 Answers2

Linked