0

Is there a Java library to normalize a string by removing spaces/special characters, lowercase all letters, for example: S-cube Abc' Inc. to scubeabcinc?

Sean Bright
  • 118,630
  • 17
  • 138
  • 146
coderz
  • 4,847
  • 11
  • 47
  • 70

3 Answers3

6

There is java.text.Normalizer. Java holds text in Unicode, and é can be written as one Unicode symbol, code point, or as two, an e and a zero-width '. Unicode normalisation is very important, for dictionaries, file names. The Normalizer can be used to decompose into letters and accents (diacritical marks), and with a regex replaceAll remove all accents.

Character has Unicode support giving Unicode names to code points, classifying code points as letters, digits, several scripts etcetera.

There is Collate, Locale oriented, that creates specific keys for words, for ordering, as Comparator. In one locale the order could be AaBbCcĉD.. and in another ABC...abc and such. Locale specifies toUpperCase. For instance in Turkish there is a letter i-without-dot and i-with-dot İi.

And then there is your use-case: a reduction. There is for instance the soundex algorithm (third party) for sound-alike representation. Regex can remove interpunction etcetera with String.replaceAll.

Sean Bright
  • 118,630
  • 17
  • 138
  • 146
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • You have summed it up quite well. OP should clarify his use case a little more, probably he needs to handle all this complexity. – Subir Kumar Sao Oct 15 '19 at 23:18
0

No need for a library other than String, String.replaceAll and String.toLowerCase does what you're looking for:

  String s = "S-cube Abc' Inc.";
  s = s.replaceAll("[^a-zA-Z]", "").toLowerCase();
Daniel Nguyen
  • 419
  • 2
  • 7
0

No library is needed. Just use regex and String#toLowerCase:

String s = "S-cube Abc' Inc.";
s = s.replaceAll("[^a-zA-Z]", "");
s = s.toLowerCase();
System.out.println(s);
Cardinal System
  • 2,749
  • 3
  • 21
  • 42