4

I am working on validating my commenting script, and I need to strip down all non-alphanumeric chars except those used in Western Europe.

My plan is to regex out all non-alphanumeric characters with:

preg_replace("/[^A-Za-z0-9 ]/", '', $string);

But that so far strips out all European characters and a £ sign, so "Café Rouge" becomes "Caf Rouge".

How can I add an array of Euro chars to the above regex.

The array is:

£, €, 
á, à, â, ä, æ, ã, å,
è, é, ê, ë,
î, ï, í, ì,
ô, ö, ò, ó, ø, õ,
û, ü, ù, ú,
ÿ,
ñ,
ß

I use UTF-8

SOLUTION:

$comment = preg_replace('/[^\p{Latin}\d\s\p{P}]/u', '', $comment);

and

$name = preg_replace('/[^\p{Latin}]/u', '', $name);

$name aslo removes punctuation marks and spaces

Thanks for quick replies

Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
Pringles
  • 4,355
  • 3
  • 18
  • 19
  • Do you just want to protect against SQL injection? That's a solved problem already. No need to restrict the input for it. - [The Great Escapism (Or: What You Need To Know To Work With Text Within Text)](http://kunststube.net/escapism/) – deceze Nov 27 '12 at 13:12
  • Protecting against injection is only one of the issues. I also want a limited amount of non alphanumeric chars, because I might reuse the titles for friendly links later on, and generally, because I don't like weird stuff coming into my DB. – Pringles Nov 27 '12 at 13:18
  • as someone who writes in non-western scripts and likes to decorate text with useful dingbats on occasion, i kind of resent having my typing called "weird stuff", and i would be really really irritated if a website silently deleted parts of something i wrote. – Eevee Jun 22 '14 at 23:36
  • Eeevee I know what you mean, but I had to do it in order to have friendly links. For example a user can create a thread that can be accessed at www.example.co.uk/group_name. If a user calls his/her threat "Café", unfortunately the URL will look like this: example.co.uk/caf%3F.Because it is hard to display the letter "é" in most English language URL bars, I ended up regex'ing it into the basic "e". – Pringles Jun 23 '14 at 10:05

2 Answers2

12
preg_replace('/[^\p{Latin}\d ]/u', '', $str);
Ωmega
  • 42,614
  • 34
  • 134
  • 203
  • It seems to me that whitespace must also be kept (`/[^\p{Latin}\d\s]/u`). – piouPiouM Nov 27 '12 at 13:52
  • @piouPiouM - I am not sure if OP wants tabs and other whitespace character to keep or not. Some European names contains also `'` and `-` characters, so most likely the set of allowed characters will be adjusted by OP based on testing anyway... – Ωmega Nov 27 '12 at 13:56
0
echo preg_replace('/[^A-Z0-9 £€áàâä...]/ui', '', $string);

The important part is the /u flag. Make sure your source code and $string are UTF-8 encoded.

I still think it's the wrong approach, because it severely limits what your users can enter and it will annoy some, but whatever floats your boat... BTW, your list contains no punctuation characters.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • 1
    Well, the thing is, I want to keep the comments professional, hence 'café' should be café and not cafe; but also I'd rather strip other symbols like hearts, diamonds and such. So I thought an exclusive array would be more suitable than an inclusive one. – Pringles Nov 27 '12 at 15:33