I've come across lots of posts on how to remove all accents. This (old!) post covers my use case, so I'll share my solution here. In my case, I only want to replace characters not present in the ISO-8859-1 charset. The use case is: read a UTF-8 file, and write it to a ISO-8859-1 file, while retaining as many of the special characters as possible (but prevent UnmappableCharacterException).
@Test
void proofOfConcept() {
final String input = "Bełchatöw";
final String expected = "Belchatöw";
final String actual = MyMapper.utf8ToLatin1(input);
Assertions.assertEquals(expected, actual);
}
Normalizer seems interesting, but I only found ways to remove all accents.
public static String utf8ToLatin1(final String input) {
return Normalizer.normalize(input, Normalizer.Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
Weirdly, the above code not only fails, but with
expected: <Belchatöw> but was: <Bełchatow>
CharsetEncoder seems interesting, but it appears I can only set a static "replacement" character (actually: byte array), so all unmappable characters become '?' or similar
public static String utf8ToLatin1(final String input) throws CharacterCodingException {
final ByteBuffer byteBuffer = StandardCharsets.ISO_8859_1.newEncoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE)
.replaceWith(new byte[] { (byte) '?' })
.encode(CharBuffer.wrap(input));
return new String(byteBuffer.array(), StandardCharsets.ISO_8859_1);
}
Fails with
expected: <Belchatöw> but was: <Be?chatöw>
My final solution is thus:
public static String utf8ToLatin1(final String input) {
final Map<String, String> characterMap = new HashMap<>();
characterMap.put("ł", "l");
characterMap.put("Ł", "L");
characterMap.put("œ", "ö");
final StringBuffer resultBuffer = new StringBuffer();
final Matcher matcher = Pattern.compile("[^\\p{InBasic_Latin}\\p{InLatin-1Supplement}]").matcher(input);
while (matcher.find()) {
matcher.appendReplacement(resultBuffer,
characterMap.computeIfAbsent(matcher.group(),
s -> Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "")));
}
matcher.appendTail(resultBuffer);
return resultBuffer.toString();
}
A few points:
- The
characterMap
needs to be extended to your needs. The Normalizer
is useful for accented characters, but you might have others. Also, extract characterMap
out (beware that computeIfAbsent updates the map, beware of concurrency!)
- Pattern.compile() shouldn't be called repeatedly, extract that out to a static