9

I want to remove all non-alphabetic character from a string. The problem is that I don't know the letter range because it is UTF8 string.

It can be ENGLISH, ՀԱՅԵՐԵՆ, ქართული, УКРАЇНСЬКИЙ, РУССКИЙ

I usually do something like this:

$str = preg_replace('/[^a-zA-Z]/', '', $str);

or

$str = preg_replace('/[^\w]/u', '', $str);

but they both clear foreign characters.

Any ideas?

JBES
  • 1,512
  • 11
  • 18
Mirko Akov
  • 2,357
  • 19
  • 19

3 Answers3

10

Use the Unicode character properties:

$str = preg_replace('/\P{L}+/u', '', $str);
Paul T. Rawkeen
  • 3,994
  • 3
  • 35
  • 51
Jocelyn
  • 11,209
  • 10
  • 43
  • 60
  • 1
    As a side note, it's worth mentioning the syntax for specifying a Unicode character class when the u flag is used. Curly brackets are needed around the code points. For example, `[\x{0400}-\x{04FF}]` matches any characters in the regular Cyrillic range. – cleong Aug 16 '12 at 15:06
  • How do you have to change the Regex to also keep numbers (next to the alphabetic ones) and not remove them? – Avatar Mar 03 '22 at 12:13
8

UPDATE: As for Unicode, RegExp will look like this [^\p{L}\s]+ (without replacing spaces)

It will replace all non-alpha characters with UTF8 support.

  • \P{L}+ - matches any non-letter symbols
  • \p{P}+ - removes punctuation only

Here are some reference docs that can be helpful:

Paul T. Rawkeen
  • 3,994
  • 3
  • 35
  • 51
1

Unicode property for letter is \pL, for non letter is \PL

$str = preg_replace('/\PL+/u', '', $str);
Toto
  • 89,455
  • 62
  • 89
  • 125