php preg_grep and umlaut/accent

Question

I have an array that consists of terms, some of them contain accented characters. I do a preg grep like this

$data= array('Napoléon','Café');
$result = preg_grep('~' . $input . '~i', $data);

So if user type in 'le' I would also want the result 'Napoléon' to be matched, which does not work with the ablove command.

I did some searching and found that this function might be relevant

preg_match("/[\w\pL]/u",$var);

How can I combine these and make it work?

What does `var_dump($input);` give? You missed to provide that with your question. — hakre, Dec 28 '12 at 16:16
$input is actually what the user input, in the above case it is 'le'. — user1906418, Dec 28 '12 at 16:19
But that are two different strings. How should `le` match `lé`, this is different. Which rule should that follow? For the two things you guessed could play a role, I don't think you can combine them at all. What made you think such a combination could be possible? Please also provide reference for `\pL` in combination with `preg_*` functions, to what are you referring to here? — hakre, Dec 28 '12 at 16:25
You can find a similar function in twitter bootstrap typeahead (autocomplete). The basic idea is that even if the user type 'cafe', 'Café' should also be returned as suggestion. — user1906418, Dec 28 '12 at 16:29
You can first replace in both user input and your data (in temporal variables) all characters like é (there can not be many of them), then your regexps will work definitely — user15, Dec 28 '12 at 16:31
@user1906418: Please link that similar function from twitter bootstrap. — hakre, Dec 28 '12 at 16:32
@user1906418: BTW, Twitter Bootstrap Typeahead does not support looking up *`Michigan`* by typing `í` - http://twitter.github.com/bootstrap/javascript.html#typeahead - Just noting. — hakre, Dec 28 '12 at 17:23

hakre · Accepted Answer · 2012-12-28T17:18:26.097

This is not possible with a regular expression pattern only. It is not because you can not tell the regex engine to match all "e" and similars. However, it is possible to first normalize the input data (both the array as well as the search input) and then search the normalized data but return the results for the non-normalized data.

In the following example I use transliteration to do this kind of normalization, I guess that is what you're looking for:

$data = ['Napoléon', 'Café'];

$result = array_translit_search('le', $data);
print_r($result);

$result = array_translit_search('leó', $data);
print_r($result);

The exemplary output is:

Array
(
    [0] => Napoléon
)
Array
(
    [0] => Napoléon
)

The search function itself is rather straight forward as written above, transliterating the inputs, doing the preg_grep and then returning the original intputs matches:

/**
 * @param string $search
 * @param array $data
 * @return array
 */
function array_translit_search($search, array $data) {

    $transliterator = Transliterator::create('ASCII-Latin', Transliterator::REVERSE);
    $normalize      = function ($string) use ($transliterator) {

        return $transliterator->transliterate($string);
    };

    $dataTrans   = array_map($normalize, $data);
    $searchTrans = $normalize($search);
    $pattern     = sprintf('/%s/i', preg_quote($searchTrans));
    $result      = preg_grep($pattern, $dataTrans);
    return array_intersect_key($data, $result);
}

This code requires the Transliterator from the Intl extension, you can replace it with any other similar transliteration or translation function.

I can not suggest to use str_replace here btw., if you need to fall-back to a translation table, use strtr instead. That is what you're looking for. But I prefer a library that brings the translation with it's own, especially if it's the Intl lib, you normally can't beat it.

Thanks a lot for your detailed answer! For the strtr vs str_replace issue aren't they performing the same thing? Is it because of performance or other reason that strtr is better? — user1906418, Dec 28 '12 at 18:37
It's because those functions do something different. You want the translation - not replacing, the replacing in the replaced string, and then again replacing in the already two times replaced sting. You just want the translation, not those multiples replaces. Performance wise I have no clue. Probably `strtr` is faster, too, but who cares? — hakre, Dec 28 '12 at 18:43

score 1 · Answer 2 · answered Dec 28 '12 at 16:50

1

You can write something like this:

$data = array('Napoléon','Café');
// do something with your input, but for testing purposes it will be simply as you wrote in your example
$input = 'le';

foreach($data as $var) {
  if (preg_match("/".str_replace(array("é"....), array("e"....), $input)."/i", str_replace(array("é"....), array("e"....), $var))) 
    //do something as there is a match
}

Actually you even don't need regex in this case, simple strpos will be enough.

answered Dec 28 '12 at 16:50

user15

1,044
10
20

Thanks a lot. That is indeed a feasible solution. I am still seeing if there a way to do it in one regex line though. – user1906418 Dec 28 '12 at 16:59
@user1906418 if you gonna use this solution, better replace characters in combinations in `$input` and keep using `preg_grep()` – nkamm Dec 28 '12 at 17:08
Do not use `str_replace` for the job, if you really need this, use [`strtr`](http://php.net/strtr). – hakre Dec 28 '12 at 17:16

php preg_grep and umlaut/accent

2 Answers2

Linked