Filter non-Alpha characters for multiple languages

Question

I want to write a regex to remove all non alpha characters, as follows:

björn -> björn
Barry's -> barrys
Who? -> who
Cibé? -> cibé
I'd -> id
ice-cream -> icecream
No!!! -> no
[{brackets}] -> brackets
~inv3rse -> invrse

and to convert all characters to their lowercase versions. How do I do this for all languages or at least for European languages using the Latin script?

I'm assuming the fourth example shouldn't have a question mark, but should the accent be there? — Michelle, Aug 07 '13 at 20:00
@Michelle The accent should be there but not the question mark, thanks! — Baz, Aug 07 '13 at 20:01
[This question](http://stackoverflow.com/questions/5436824/matching-accented-characters-with-javascript-regexes) may help you match accented characters - try adding `\u00C0-\u017F` to your character class (I haven't verified what chars that includes, however). — Michelle, Aug 07 '13 at 20:07

score 3 · Answer 1 · answered Aug 07 '13 at 20:02

3

str.toLowerCase().replace(/[^a-z]/gi,'');

this will convert everything to lowercase, then replace everything that isn't an alphabetic character (a-z) to the empty string, essentially removing them. in order to keep certain other characters (like e with an accent mark) just add that symbol to the regex.

answered Aug 07 '13 at 20:02

gr3co

893
1
7
15

But this will convert "Cibé?" to "cib" rather than "cibé". – Baz Aug 07 '13 at 20:03
@Baz edit the regex to include the unicode for any additional characters you want. – gr3co Aug 07 '13 at 20:04
@gr3co But there are so many of these symbols. Icelandic, Danish, Norwegian, Irish and Scottish alone contain: ýþæöøåäáéíóúàèìòù – Baz Aug 07 '13 at 20:10
Manually enumerating all accented characters is cumbersome, error-prone, and will need to be revised each time the Unicode standard adds new accented characters. It also doesn't cover combining characters. (e.g. COMBINING ACUTE ACCENT ABOVE + UPPER CASE LATIN E). It throws away the combining character but keeps the E. On the other hand, you want to remove it if it combines with a non-alphabetic. – Raymond Chen Aug 07 '13 at 20:14
I'm not sure if `\w` covers these foreign characters but you can try this out and tell me if it works: `str.toLowerCase().replace(/[^\w]/gi, '')`. Basically, `\w` covers all "word characters" (numbers, digits). – gr3co Aug 07 '13 at 20:17
@gr3co I think its just short for [a-zA-Z0-9_] – Baz Aug 07 '13 at 20:19
@Baz then yeah I guess the best way to do it is to manually add everything. Sorry man. – gr3co Aug 07 '13 at 20:21

Filter non-Alpha characters for multiple languages

1 Answers1