1

I've got a little messy database containing names of many institutions around the world.

I want to display them including national characters, but without invalid characters - those displayed in firefox as unicode numbers.

How to filter them out?

Database has utf8 encoding, but some strings were inserted with wrong encodings or were a mess already in sources.

I do not want to fix the database - it's too big. I want to just filter it out - "out of sight out of mind"

hakre
  • 193,403
  • 52
  • 435
  • 836
Jacek Kaniuk
  • 5,229
  • 26
  • 28

3 Answers3

7

I want to just filter it out

You have got an unspecified encoding/charset with your data. This is a huge problem.

You can first try to convert it into utf-8 and then strip all non-printable characters:

$str = iconv('utf-8', 'utf-8//ignore', $str);

echo preg_replace('/[^\pL\pN\pP\pS\pZ]/u', '', $str);

The problem is, that the iconv function can only try. It will drop any invalid character sequence. As of php 5.4 it will drop the complete string however, if the input encoding specified is invalid.

You will see a warning since PHP 5.3 already that the input string has an invalid encoding.

You can go around this by removing all invalid utf-8 byte sequences first:

$str = valid_utf8_bytes($str);

echo preg_replace('/[^\pL\pN\pP\pS\pZ]/u', '', $str);

/**
 * get valid utf-8 byte squences
 *
 * take over all matching bytes, drop an invalid sequence until first
 * non-matching byte.
 * 
 * @param string $str
 * @return string
 */
function valid_utf8_bytes($str)
{
    $return = '';
    $length = strlen($str);
    $invalid = array_flip(array("\xEF\xBF\xBF" /* U-FFFF */, "\xEF\xBF\xBE" /* U-FFFE */));

    for ($i=0; $i < $length; $i++)
    {
        $c = ord($str[$o=$i]);

        if ($c < 0x80) $n=0; # 0bbbbbbb
        elseif (($c & 0xE0) === 0xC0) $n=1; # 110bbbbb
        elseif (($c & 0xF0) === 0xE0) $n=2; # 1110bbbb
        elseif (($c & 0xF8) === 0xF0) $n=3; # 11110bbb
        elseif (($c & 0xFC) === 0xF8) $n=4; # 111110bb
        else continue; # Does not match

        for ($j=++$n; --$j;) # n bytes matching 10bbbbbb follow ?
            if ((++$i === $length) || ((ord($str[$i]) & 0xC0) != 0x80))
                continue 2
        ;

        $match = substr($str, $o, $n);

        if ($n === 3 && isset($invalid[$match])) # test invalid sequences
            continue;

        $return .= $match;
    }
    return $return;
}
hakre
  • 193,403
  • 52
  • 435
  • 836
  • Thanks, I will try that function and report back. – Jacek Kaniuk Oct 03 '11 at 15:52
  • It's only partly working - valid_utf8_bytes does not change anything, but `preg_replace('/[^\pL\pN\pP\pS\pZ]/u', '', $str)` does not work as expected: �Mario Negri� Institute > Mario Negri Institute (ok), but: Universität Göttingen > Universitat Gottingen (not ok - some regional characters are removed) – Jacek Kaniuk Oct 04 '11 at 11:05
  • I assume the `ä` in Universität are more than one code-point, `a` plus the two points on top. You might want to allow those as well, see [Unicode character properties PCRE PHP](http://www.php.net/manual/en/regexp.reference.unicode.php) for what is used (and could be used) inside the regular expression, but it's hard to tell w/o the original strings. – hakre Oct 04 '11 at 11:28
  • 1
    I've added \pM and it's ok now. $text = preg_replace('/[^\pL\pN\pP\pS\pZ\pM]/u', ' ', $text);. Thanks! – Jacek Kaniuk Oct 04 '11 at 12:51
  • Cool, thanks for the feedback. Keep the order alphabetical so it's easier to maintain ;) – hakre Oct 04 '11 at 12:55
1

The database might not be the problem entirely - if the tables are utf8 encoded the strings in them should have been converted (I think). The issue I've ran into with this has been a matter of correctly ensuring the encoding is consistent. For instance the mysqli connector, by default, reverts to Latin-8859 IIRC so it's quite possible to have the output in utf8, the database in utf8 and still end up with ? characters because they're converted to Latin by the mysqli connector.

To ensure utf8 across the board you need to do something like:

In the database:

ensure the collation is something like utf8_general_ci

At the top of the PHP view file:

<?php header('Content-Type:Text/Plain;charset=utf-8'); ?>

In the HTML meta tag (optional):

<meta http-equiv="content-type" content="text/html;charset=utf-8" />

AND in the database connector (using MySQLi as an example):

mysqli::set_charset('utf8'); #note that for MySQL it isn't hyphenated

You might find that resolves the problem anyway.

CD001
  • 8,332
  • 3
  • 24
  • 28
  • Fair enough - at least that's eliminated this as a possible cause ;) I think hakre's solution above offers you your best bet at the moment; it's a bit hacky but I can't really think of an elegant way of doing it in PHP - there _might_ be something in the MB string library? http://php.net/manual/en/book.mbstring.php – CD001 Oct 03 '11 at 15:56
0

If the database is the issue which it seems to be in your case (and fixing it is out of the way) then maybe just print out each character from the string using ORD and find the value for the control character that is not well sent.

Then when you know the control character value, pass these values into a function that searches for that control character and try to change the utf-8 encoding (the flawed one) with corresponding UTF8 characters live.

Mathieu Dumoulin
  • 12,126
  • 7
  • 43
  • 71
  • The database is build from many sources, some of them already containing items with bad encoding (each time different). It would be a massive task to fix it. I need to find some easy workaround. – Jacek Kaniuk Oct 03 '11 at 15:42