Is there a Java library to normalize a string by removing spaces/special characters, lowercase all letters, for example: S-cube Abc' Inc.
to scubeabcinc
?

- 118,630
- 17
- 138
- 146

- 4,847
- 11
- 47
- 70
-
5`s = s.replaceAll("\\W", "").toLowerCase();` – Sean Bright Oct 15 '19 at 22:46
3 Answers
There is java.text.Normalizer
. Java holds text in Unicode, and é
can be written as one Unicode symbol, code point, or as two, an e
and a zero-width '
. Unicode normalisation is very important, for dictionaries, file names.
The Normalizer can be used to decompose into letters and accents (diacritical marks), and with a regex replaceAll
remove all accents.
Character
has Unicode support giving Unicode names to code points, classifying code points as letters, digits, several scripts etcetera.
There is Collate
, Locale oriented, that creates specific keys for words, for ordering, as Comparator
. In one locale the order could be AaBbCcĉD.. and in another ABC...abc and such. Locale
specifies toUpperCase. For instance in Turkish there is a letter i-without-dot Iı
and i-with-dot İi
.
And then there is your use-case: a reduction. There is for instance the soundex algorithm (third party) for sound-alike representation. Regex can remove interpunction etcetera with String.replaceAll
.

- 118,630
- 17
- 138
- 146

- 107,315
- 7
- 83
- 138
-
You have summed it up quite well. OP should clarify his use case a little more, probably he needs to handle all this complexity. – Subir Kumar Sao Oct 15 '19 at 23:18
No need for a library other than String, String.replaceAll
and String.toLowerCase
does what you're looking for:
String s = "S-cube Abc' Inc.";
s = s.replaceAll("[^a-zA-Z]", "").toLowerCase();

- 419
- 2
- 7
No library is needed. Just use regex and String#toLowerCase
:
String s = "S-cube Abc' Inc.";
s = s.replaceAll("[^a-zA-Z]", "");
s = s.toLowerCase();
System.out.println(s);

- 2,749
- 3
- 21
- 42