1

I am using the regular expression below to weed out any non-Latin characters. As a result, I found that if I use a string larger than 342 characters, the function fails, everything aborts, and the website connection is reset.

I narroed it down to the \p{P} unicode character property, which matches any punctuation character.

Does anyone know/see where the problem lies, exactly?

preg_match('/^([\p{P}\p{S}&\p{Latin}0-9]|\s)*$/u', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa');

tchrist
  • 78,834
  • 30
  • 123
  • 180
KcYxA
  • 243
  • 1
  • 3
  • 19
  • 1
    hy are you eeding out all non-latin characters? Ho ould your text look if you removed all instances of certain characters from English text? – Greg Hewgill Jul 05 '10 at 01:43
  • @Greg : How's that 'w' key working for you? – Stephen Jul 05 '10 at 01:52
  • @Greg, I basically want people to use Latin characters only. It does the job except in the case where there are more than 342 characters. I'm not sure why. Thus the question. – KcYxA Jul 05 '10 at 02:11

1 Answers1

1

If you're "weeding out" non-Latin characters, why not just do this:

preg_replace('/[^\p{Latin}]+/u', '', $s)

EDIT: Okay, so you're trying to validate the input. I was going to say, use this:

preg_match('/^[\p{Latin}]+$/u', $s)

...but it turns out that only matches Latin letters. I was thinking of Java's undocumented shorthand, \p{L1}, which matches everything in the Latin1 (ISO-8859-1) character set, but in PHP you have to spell it out:

preg_match('/^[\x00-\xFF]+$/u', $s)
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • thank you. However, I would like to notify the user of the error, and I need the validation to fail in order for an error to occur. Thus the validation rule (the reg expression) needs to look for what 'correct' looks like. – KcYxA Jul 05 '10 at 04:29
  • your suggestion worked. Then I tried to backward engineer it into my way and it turns out the culprit was the parantheses and the "or" statement. For whatever reason. So this worked as well: '/^[\p{P}\p{S}&\p{Latin}0-9\s]*$/u'. Thanks! – KcYxA Jul 06 '10 at 03:55
  • Oh yeah, I meant to suggest that. I knew it was gratuitously inefficient to put the `\s` in its own alternative and wrap the whole thing in a capturing group, but I wouldn't have expected it to go pear-shaped on such a small input. – Alan Moore Jul 06 '10 at 09:52