Odd behavior from mb_strlen when calling it through two functions

Question

I often have to strip accents from strings, so I wrote a function, called accent(), to manage this more effectively. It was working well, but I recently ran into some characters that didn't get parsed correctly. This turned out to be an encoding issue (what else?) so I totally rewrote my code... and now I'm running into a new issue.

When I use the function directly, it seems to be working fine. However, when the function is called from within another function, it seems to break the code.

The second function, makesortname(), handles the creation of sort names. It does a bunch of stuff, then runs the result through accent() to strip any accents.

As an example, I'll take the name "Ekrem Ergün". Running it through makesortname() is supposed to return "ErgünEkrem" which then should become "ErgunEkrem" after using accent().

My accent() function uses mb_strlen() then runs each character in the string against a table to check for accents. If I print out each character to test it out, I'm noticing that mb_strlen is only reporting 5 characters instead of 10 and that 'ünEkre' is being treated as ONE character (which explains why the accent is not being stripped, as it's checking for that string instead of just 'ü').

Apparently, the problem seems to be my use of 'utf8' within the mb_strlen function. Thing is, if I don't include it, the code doesn't always work, depending on the string. And in this specific case, removing it only fixes the string length, but the ü still doesn't get parsed (even if I remove the 'utf8' from the mb_substr as well).

Here's the code I'm using.

function accent($term)
    {
    $orstr = $term;
    $str2 = $orstr;
    $strlen = mb_strlen($orstr, utf8);
    for( $i = 0; $i < $strlen; $i++ )
        {
        $char = mb_substr($orstr, $i, 1, utf8);

        $chkacc = mysql_db_query("Definitions","SELECT NoAcc_col FROM tbl_Accents WHERE Letr_col = '$char' ");
            while($row = mysql_fetch_object($chkacc))
                $noacc = $row->NoAcc_col;
            mysql_free_result($chkacc);

        if($noacc != '')    $newchar = $noacc;
        else                $newchar = $char;

        $str2 = str_replace($char, $newchar, $str2);
        unset($noacc);
        }
    return $str2;
    }

For full disclosure, I'll also include the makesortname() function, though I doubt it has anything to do with the problem...

function makesortname($nameN)
    {
    $nameN = dashnames($nameN);
    $wordlist = explode(' ', $nameN, 2);
    $wordc = count($wordlist);

    if($wordc == 1)             $nameS = $wordlist[0];
    if($wordc == 2)             $nameS = $wordlist[1] . $wordlist[0];

    $nameS = str_replace(' ', '', $nameS);          $nameS = str_replace(',', '', $nameS);
    $nameS = str_replace(':', '', $nameS);          $nameS = str_replace(';', '', $nameS);
    $nameS = str_replace('.', '', $nameS);          $nameS = str_replace('-', '', $nameS);
    $nameS = str_replace("'", '', $nameS);          $nameS = str_replace('"', '', $nameS);
    $nameS = str_replace("(", '', $nameS);          $nameS = str_replace(")", '', $nameS);
    $nameS = str_replace("]", '', $nameS);          $nameS = str_replace("[", '', $nameS);
    $nameS = str_replace("/", '', $nameS);
    $nameS = str_replace("&", 'and', $nameS);
    $nameS = strtolower(accent($nameS));

    return $nameS;
    }

Show code.... what other functions are you using? `substr()`? referencing bytes as `$string[0]`? — Mark Baker, Nov 05 '15 at 14:31
Ack. I made some mistakes in my testing. With some changes, I was able to assess this better, but the problem still isn't fixed. I'll edit my question and include the code. — asg, Nov 05 '15 at 14:38
Let's put it this way: there's some problem in your code. A description of what your code is *supposed* to do doesn't help us at all, we need to see the actual code in detail to help you. — deceze, Nov 05 '15 at 14:40
OK, I've edited my question (see last paragraph for an update on the problem) and included my code. — asg, Nov 05 '15 at 14:50
strtolower() isn't charset aware, so that won't help.... use mb_strtolower() to be multibyte safe — Mark Baker, Nov 05 '15 at 14:50
Unless you've defined `utf8` as a constant, you should be using it as a string `'UTF-8'` — Mark Baker, Nov 05 '15 at 14:52
@Mark Oh, I didn't know that. OK, I fixed both of those, thanks ;-) — asg, Nov 05 '15 at 14:55

score 0 · Accepted Answer · answered Nov 05 '15 at 17:21

So I managed to fix my own problem!

I wrote a new function to check the encoding of the string, which then allows me to use either strlen/substr() or mb_strlen/mb_substr() depending on the encoding.

Additionally, there also was an encoding issue within my mysql table.

Now that all this has been fixed, the function works as expected.

Thanks for your help and contributions, everyone!

Odd behavior from mb_strlen when calling it through two functions

1 Answers1