0

For some reason my special characters got encoded as the following string in a mysql database:

Ã?

Which shows up as:

Ã?

But actually should show up as:

Ö
  1. What went wrong here? I use UTF-8 everywhere.

  2. How can I fix this without recreating all content?

clamp
  • 33,000
  • 75
  • 203
  • 299
  • Are you using PDO? Maybe take a look at this -> PDO::MYSQL_ATTR_INIT_COMMAND http://stackoverflow.com/questions/10209777/php-pdo-with-special-characters – Bart Scheffer Nov 01 '15 at 21:00
  • if the bad encoding is consistent, you could use php or MySQL for find and replace –  Nov 01 '15 at 21:01
  • @BartScheffer i am not using PDO, but i have set `$mysqlconn->set_charset("utf8");` – clamp Nov 01 '15 at 21:04
  • 1
    How did the characters get into those entities? That character representation is correct for the entity. – chris85 Nov 01 '15 at 21:06
  • Ah ok. I'm not really familiar with other sorts of connections since I've always been using PDO. But I remember another question where 'htmlspecialchars' was the solution. http://php.net/manual/en/function.htmlspecialchars.php Good Luck! – Bart Scheffer Nov 01 '15 at 21:06
  • 1
    Ã? should really show up as Ã?. I think the problem lies with what was inserted – Mateo Barahona Nov 01 '15 at 21:09
  • 1
    @Bart Scheffer I think you mean [html_entity_decode](http://php.net/manual/en/function.html-entity-decode.php) – Hasse Björk Nov 01 '15 at 21:10
  • Also if you use utf8 everywhere why convert to entities? That is usually for non-utf8 dbs, entities are ascii. – chris85 Nov 01 '15 at 21:12
  • Ã **is** Ã and Ö should be Ö when converting html to UTF-8 – Hasse Björk Nov 01 '15 at 21:12
  • 2
    My guess is that things went wrong with the Ö even before the values were inserted into the database. The utf-8 sequence for Ö has a first byte that in ANSI is Ã, then if after wrongly interpreting that as ANSI, this is escaped for HTML, you get what your read. Does not seem the type of manipulation a database does on its own. It is in the input procedure that things went wrong. Can you show the code of the insert? – trincot Nov 01 '15 at 21:17
  • @trincot the insert code is `$event = htmlentities(strip_tags($_POST["event"]));` – clamp Nov 01 '15 at 21:23
  • 1
    Maybe you should have used `$event = html_entity_decode(strip_tags($_POST["event"]));` – Hasse Björk Nov 01 '15 at 21:29
  • Could you test and post this "Ö" via your form, and echo what `$event` is like on that call? Some debugging is needed here.. – trincot Nov 01 '15 at 21:48
  • What version of PHP are you running? – trincot Nov 01 '15 at 22:01
  • @trincot: php version is 5.3.29 but i can easily change it. – clamp Nov 02 '15 at 21:09
  • ok, I have included that fact in my answer below. it affects this behaviour. – trincot Nov 02 '15 at 21:16

2 Answers2

2

I executed the following in PHP:

<?php
echo str_replace("&", "&amp;", htmlentities("Ö", 0, "ISO-8859-1")) , '<br />';    
echo str_replace("&", "&amp;", htmlentities("Ö", 0, "UTF-8")), "</br>";
?>

The str_replace is just there to reveal any HTML mnemonics, which would otherwise be translated by the browser to the original character, which I don't want to happen.

You will get this as output:

&Atilde;�
&Ouml;

You'll recognise the first value as what you found in the database, and the second one is a bit like you wanted it to be. Add to this the fact that the default value for the third argument to htmlentities depends on your PHP version and is ISO-9959-1 in the case of version 5.3, the one you use. Also realise that HTML documents which do not specify a character encoding will by default post form data in ISO-8859-1 format. Combining all this might give a clue about the cause of your problem:

My guess is that the data is correctly posted as UTF-8 to the server, but then htmlentities interprets this as a non-UTF-8, single byte encoding, and so turns one, multi-byte character into two single byte characters.

Now to the measures to take that this does not continue to happen:

First make sure that your HTML form has the UTF-8 encoding, because this determines the default encoding that a form will use for sending its data to the server:

<head>
    <meta charset="UTF-8">
</head> 

Make sure this is not overruled by another encoding in the form tag's accept-charset attribute.

Then, skip the htmlentities call. You should not turn characters into their HTML mnemonic when storing them in the database. MySql supports UTF-8 characters, so just store them like that.

For the second question, you'll have to find all cases and bulk replace them as you find new instances. You could get get a little help by producing some SQL statements with a PHP script like the following:

<?php
    // list all your non-ASCII characters here. Do not use str_split.
    $chars = ["Ö","õ","Ũ","ũ"];
    foreach ($chars as $ch) {
        $bad = str_replace("&", "&amp;", htmlentities($ch, 0, "ISO-8859-1"));
        echo "update mytable set myfield = replace(myfield, '$bad', '$ch') 
                where instr(myfield, '$bad') > 0;<br />";
    }
?>

The output of this script will look like this:

update mytable set myfield = replace(myfield, '&Atilde;�', 'Ö') where instr(myfield, '&Atilde;�') > 0;
update mytable set myfield = replace(myfield, '&Atilde;&micro;', 'õ') where instr(myfield, '&Atilde;&micro;') > 0;
update mytable set myfield = replace(myfield, '&Aring;&uml;', 'Ũ') where instr(myfield, '&Aring;&uml;') > 0;
update mytable set myfield = replace(myfield, '&Aring;&copy;', 'ũ') where instr(myfield, '&Aring;&copy;') > 0;

Of course, you could decide to make a PHP script that will even do the updates itself.

Hopefully you can use this information to fix the issues.

trincot
  • 317,000
  • 35
  • 244
  • 286
  • thanks for your detailed answer! the first part is correct! it was caused by different php versions on the dev-server and the live-server. about the second part, i might have to come back to you tomorrow, since today i cant test this script right now. – clamp Nov 02 '15 at 22:03
0

For PDO, use something like

$db = new PDO('dblib:host=host;dbname=db;charset=UTF-8', $user, $pwd);

&Atilde;? is two or three things going wrong, not just one! C396 is the utf8 hex for Ö or the latin1 hex for the two characters Ö. It requires something else to go wrong to get ? or the black diamond.

Let's see what is in the table; do

SELECT col, HEX(col) FROM tbl WHERE ...

(If you have already done the previously suggested replace(), then the table may be in an even worse mess. Or it might be fixed.)

Rick James
  • 135,179
  • 13
  • 127
  • 222