3

According to my sources who speak Spanish, if I am searching for the pattern "pan" in a list of strings which contains both these values:

$normalString = "abcpan123";
$specialString = "abcpañ123";

it should match both of them -- e.g. strpos("pan", $normalString) and strpos("pan", $specialString) should both return '3'.

However, only the first one returns non false value.

If I do a similar search in mySQL for "like "%pan%" it matches both strings.

Presumably this situation applies not only to n-tilde, but to other characters modified with accents, etc as well.

I am puzzled as to how to handle this ... it seems like a problem that others must have encountered and solved, but I haven't found a good existing solution. I was hoping for some different function in PHP, or a configuration of some sort, but no joy.

Certainly I could write some custom code with regular expressions instead of using strpos(), but I'm not even sure how to determine in multiple foreign languages which characters would be considered equivalent.

Any help for me?

phihag
  • 278,196
  • 72
  • 453
  • 469
tristan
  • 91
  • 4

1 Answers1

0

strpos compares characters as they are, and "n" is simply not "ñ". In fact it simply compares bytes, it won't even be aware of different encodings. If you want locale and collation aware comparison, use strcoll. Read the comments as well, there isn't that much documentation about it.

The database includes a collation setting out of the box, which makes it perform such fuzzy searches.

An alternative would be to normalize all strings to plain ASCII characters before comparing them using iconv('UTF-8', 'ASCII//TRANSLIT', $string).

deceze
  • 510,633
  • 85
  • 743
  • 889
  • The iconv() call works great for me, in a php command line script, but does not appear to do the correct thing when I am running php through apache. Any ideas why? – tristan Mar 15 '12 at 20:51