3

I want to write a regex to remove all non alpha characters, as follows:

björn -> björn
Barry's -> barrys
Who? -> who
Cibé? -> cibé
I'd -> id
ice-cream -> icecream
No!!! -> no
[{brackets}] -> brackets
~inv3rse -> invrse

and to convert all characters to their lowercase versions. How do I do this for all languages or at least for European languages using the Latin script?

Baz
  • 12,713
  • 38
  • 145
  • 268
  • I'm assuming the fourth example shouldn't have a question mark, but should the accent be there? – Michelle Aug 07 '13 at 20:00
  • @Michelle The accent should be there but not the question mark, thanks! – Baz Aug 07 '13 at 20:01
  • [This question](http://stackoverflow.com/questions/5436824/matching-accented-characters-with-javascript-regexes) may help you match accented characters - try adding `\u00C0-\u017F` to your character class (I haven't verified what chars that includes, however). – Michelle Aug 07 '13 at 20:07

1 Answers1

3
str.toLowerCase().replace(/[^a-z]/gi,'');

this will convert everything to lowercase, then replace everything that isn't an alphabetic character (a-z) to the empty string, essentially removing them. in order to keep certain other characters (like e with an accent mark) just add that symbol to the regex.

gr3co
  • 893
  • 1
  • 7
  • 15
  • But this will convert "Cibé?" to "cib" rather than "cibé". – Baz Aug 07 '13 at 20:03
  • @Baz edit the regex to include the unicode for any additional characters you want. – gr3co Aug 07 '13 at 20:04
  • @gr3co But there are so many of these symbols. Icelandic, Danish, Norwegian, Irish and Scottish alone contain: ýþæöøåäáéíóúàèìòù – Baz Aug 07 '13 at 20:10
  • Manually enumerating all accented characters is cumbersome, error-prone, and will need to be revised each time the Unicode standard adds new accented characters. It also doesn't cover combining characters. (e.g. COMBINING ACUTE ACCENT ABOVE + UPPER CASE LATIN E). It throws away the combining character but keeps the E. On the other hand, you want to remove it if it combines with a non-alphabetic. – Raymond Chen Aug 07 '13 at 20:14
  • I'm not sure if `\w` covers these foreign characters but you can try this out and tell me if it works: `str.toLowerCase().replace(/[^\w]/gi, '')`. Basically, `\w` covers all "word characters" (numbers, digits). – gr3co Aug 07 '13 at 20:17
  • @gr3co I think its just short for [a-zA-Z0-9_] – Baz Aug 07 '13 at 20:19
  • @Baz then yeah I guess the best way to do it is to manually add everything. Sorry man. – gr3co Aug 07 '13 at 20:21