0

I have problem with latin chars, here is the code:

$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www', 'on', 'ona', 'ja');

$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string

$string = preg_replace('/[^a-zA-Z0-9žšđč掊ĐČĆ -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…

$string = mb_strtolower($string); // make it lowercase

preg_match_all('/\b.*?\b/i', $string, $matchWords);

$matchWords = $matchWords[0];

foreach ( $matchWords as $key=>$item ) {
    if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
        unset($matchWords[$key]);
    }
}

$wordCountArr = array();
if ( is_array($matchWords) ) {
    foreach ( $matchWords as $key => $val ) {
        $val = strtolower($val);
        if ( isset($wordCountArr[$val]) ) {
            $wordCountArr[$val]++;
        } else {
            $wordCountArr[$val] = 1;
        }
    }
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;

when I return $matchWords[0] from this code:

preg_match_all('/\b.*?\b/i', $string, $matchWords);

i get this string with imploded space on array:

ti si mi znaj na srcu kvar znaj znaj znaj srcu ž urka

there is space on ž urka

Mirza Delic
  • 4,119
  • 12
  • 55
  • 86
  • can you just look at my post, preg_match_all('/\b.*?\b/i', $string, $matchWords) is function that is making space on latin letters, just read please, and after that for loop remove value where it is only 1 letter – Mirza Delic Aug 26 '12 at 23:03
  • there is space on ž urka, but what has it been initially?(please post $string) – Dr.Molle Aug 26 '12 at 23:15
  • **Ti si mi, znaj, na srcu kvar, znaj znaj znaj srcu. žurka žšđčć ŽŠĐČĆ :)** and output **ti si mi znaj na srcu kvar znaj znaj znaj srcu ž urka** with imploded space on array (this is just after preg_match_all, until then, everything works) – Mirza Delic Aug 26 '12 at 23:19

1 Answers1

2

From the docs: A word boundary is a position in the subject string where the current character and the previous character do not both match \w or \W (i.e. one matches \w and the other matches \W), or the start or end of the string if the first or last character matches \w, respectively.

the ž(including the space before it) matches a \W but the u matches \w , therefore you'll get ž and urka

These characters at the end will not match the pattern:

 žšđčć ŽŠĐČĆ :)

...they are all \W-characters and need to be followed by a \w-character to match the pattern(the 2nd \b)

I guess your are looking for the u-modifier. Try

preg_match_all('/\b.*?\b/iu', $string, $matchWords);
Dr.Molle
  • 116,463
  • 16
  • 195
  • 201