0

So I have a weird truncate issue! Can't find a specific answer on this.

So basically there's an issue with an apparent ISO character ½ that truncates the rest of the text upon insertion into a column with UTF-8 specified.

Lets say that my string is: "You need to add ½ cup of water." MySQL will truncate that to "You need to add"

if I:

print iconv("ISO-8859-1", "UTF-8//IGNORE", $text);

Then it outputs:

½

O_o

OK that doesn't work because I need the 1/2 by itself. If I go to phpMyAdmin and copy and paste the sentence in and submit it, it works like a charm as the whole string is in there with half symbol and remaining text! Something is wrong and I'm puzzled at what it is. I know this will probably affect other characters so the underlying problem needs to be addressed.

The language I'm using is php, the file itself is encoded as UTF-8 and the data I'm bringing in has content-type set to ISO-8859-1. The column is utf8_general_ci and all the mysql character sets are set to UTF-8 in php: "SET character_set_result = 'utf8', etc..."

Paul Carlton
  • 2,785
  • 2
  • 24
  • 42
  • The text probably gets messed up somewhere between receiving it and sending it to the database, you should post more of your code. Also, are you setting the connection encoding to something using the `mysql_set_charset` function or equivalent? – Joni Aug 09 '13 at 09:35
  • Ok I will post more code but the iconv test I'm preparing is not being inputted in the database. I'm actually just printing it out. I will specify this in the quesiton. – Paul Carlton Aug 11 '13 at 17:37

3 Answers3

4

Something in your code isn't handling the string as UTF8. It could be your PHP/HTML, it could be in your connection to the DB, or it could be the DB itself - everything has to be set as UTF8 consistently, and if anything isn't, the string will get truncated exactly as you see when passing across a UTF8/non-UTF8 boundary.

I will assume your DB is UTF8 compliant - that is easiest to check. Note that the collation can be set at the server level, database level, the table level, and the column level within the table. Setting UTF8 collation on the column should override anything else for storage, but the others will still kick in when talking to the DB if they're not also UTF8. If you're not sure, explicitly set the connection to UTF8 after you open it:

$dbh->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, "SET NAMES 'utf8'");

Now your DB & connection are UTF8, make sure your web page is too. Again, this can be set in more than one place (.htaccess, php.ini). If you're not sure / don't have access, just override whatever PHP is picking up as default at the top of your page:

<?php ini_set('default_charset', 'UTF-8'); ?>

Note that you want the above right at the start, before any text is output from your page. Once text gets output, it is potentially too late to try and specify an encoding - you may already be locked into whatever is default on your server. I also then repeat this in my headers (possibly overkill):

<head>
<meta charset="UTF-8">
<meta http-equiv="Content-type" content="text/html; charset=UTF-8">
</head>

And I override it on forms where I'm taking data as well:

<FORM NAME="utf8-test" METHOD="POST" ACTION="utf8-test.php" enctype="multipart/form-data" accept-charset="UTF-8">"

To be honest, if you've set the encoding at the top, my understanding is that the other overrides aren't required - but I keep them anyway, because it doesn't break anything either, and I'd rather just state the encoding explicitly, than let the server make assumptions.

Finally, you mentioned that in phpMyAdmin you inserted the string and it looked as expected - are you sure though that the phpMyAdmin pages are UTF8? I don't think they are. When I store UTF8 data from my PHP code, it views like raw 8-bit characters in phpMyAdmin. If I take the same string and store it directly in phpMyAdmin, it looks 'correct'. So I'm guessing phpMyAdmin is using the default character set of my local server, not necessarily UTF8.

For example, the following string stored from my web page:

I can¹t wait

Reads like this in my phpMyAdmin:

I can’t wait

So be careful when testing that way, as you don't really know what encoding phpMyAdmin is using for display or DB connection.

If you're still having issues, try my code below. First I create a table to store the text in UTF8:

CREATE TABLE IF NOT EXISTS `utf8_test` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `my_text` varchar(8000) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

And here's some PHP to test it. It basically takes your input on a form, echoes that input back at you, and stores/retrieves the text from the DB. Like I said, if you view the data directly in phpMyAdmin, you might find it doesn't look right there, but through the page below it should always appear as expected, due to the page & db connection both being locked to UTF8.

<?php
  // Override whatever is set in php.ini
  ini_set('default_charset', 'UTF-8');

  // The following should not be required with the above override
  //header('Content-Type:text/html; charset=UTF-8');

  // Open the database
  $dbh = new PDO('mysql:dbname=utf8db;host=127.0.0.1;charset=utf8', 'root', 'password');

  // Set the connection to UTF8
  $dbh->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, "SET NAMES 'utf8'");
  // Tell MySql to do the parameter replacement, not PDO
  $dbh->setAttribute(PDO::ATTR_EMULATE_PREPARES, false);
  // Throw exceptions (and break the code) if a query is bad
  $dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

  $id = 0;
  if (isset($_POST["StoreText"]))
  {
    $stmt = $dbh->prepare('INSERT INTO utf8_test (my_text) VALUES (:my_text)');
    $stmt->execute(array(':my_text' => $_POST['my_text']));
    $id = $dbh->lastInsertId();
  }
?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional/EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="UTF-8">
<meta http-equiv="Content-type" content="text/html; charset=UTF-8">

<title>UTF-8 Test</title>
</head>

<body>

<?php
  // If something was posted, output it
  if (isset($_POST['my_text']))
  {
    echo "POSTED<br>\n";
    echo $_POST['my_text'] . "<br>\n";
  }

  // If something was written to the database, read it back, and output it
  if ($id > 0)
  {
    $stmt = $dbh->prepare('SELECT my_text FROM utf8_test WHERE id = :id');
    $stmt->execute(array(':id' => $id));
    if ($result = $stmt->fetch())
    {
      echo "STORED<br>\n";
      echo $result['my_text'] . "<br>\n";
    }
  }

  // Create a form to take some user input
  echo "<FORM NAME=\"utf8-test\" METHOD=\"POST\" ACTION=\"utf8-test.php\" enctype=\"multipart/form-data\" accept-charset=\"UTF-8\">";

  echo "<br>";

  echo "<textarea name=\"my_text\" rows=\"20\" cols=\"90\">";

  // If something was posted, include it on the form
  if (isset($_POST['my_text']))
  {
    echo $_POST['my_text'];
  }

  echo "</textarea>";

  echo "<br>";
  echo "<INPUT TYPE = \"Submit\" Name = \"StoreText\" VALUE=\"Store It\" />";

  echo "</FORM>";
?>
<br>

</body>

</html>
Peter Barton
  • 577
  • 1
  • 4
  • 12
  • I made a work around because most of the time it's because of some strange funny characters like a longer dash or a different tick, what this would cause is those funny characters. I have the database collation to utf-8 and setting charset to utf-8, all that was there which is why I was scratching my head. Something is up but I didn't have enough time to nail down the specifics. What I basically did to fix this was to use a very nice regex pattern while preserving newlines and making sure some punctuation stayed: preg_replace("@[[:^print:]]@", '', $output) works nicely. – Paul Carlton Nov 13 '14 at 17:36
  • Besides, I'm not dealing with foreign languages and I think basically what is happening is people are using word doc and then just uploading that with those other kinds of characters so, really forcing the ascii table would be fine but you miss out on characters and there would be this strange trailing bit... you have no idea how weird it was. – Paul Carlton Nov 13 '14 at 17:37
  • UTF-8 will represent anything, not just foreign languages. My text is English, but like you say it's the special quotes, etc that cause issues. If you lock everything to UTF-8, you'll handle any text - I suspect something in your web page wasn't forcing this, and the string was being handled as non-UTF-8 somewhere. You just have to be sure that anywhere the text goes is UTF-8 also - some of my text goes to email, you have to set the encoding there too, etc. If you can't guarantee that, you can strip the chars or do conversions - up to you how accurately you want to preserve the original text. – Peter Barton Nov 13 '14 at 23:51
0

Check into mb_convert_encoding if you can't change the way the data is handled. Otherwise, do yourself a favor and get your encoding on the same page before it gets out of hand. UTF-8 uses multibyte characters which aren't recognized in the ISO-8859-1 (Latin) encoding. wikipedia. This page and this page are good sources, as well as this debug table.

Finally, I've run into this when various combinations of htmlentities, htmlspecialchars and html_entity_decode are used..

Curtis Mattoon
  • 4,642
  • 2
  • 27
  • 34
  • I have checked mb_convert_encoding. The test I ran on it was converting from ISO to UTF8, but it does still give me that funny A. I'm missing something, probably something ridiculously obvious which is often the case with these sorts of problems for me. – Paul Carlton Aug 11 '13 at 17:34
0

Did you call set_charset() on your MySQLi database connection? It's required to properly use real_escape_string().

$db = new mysqli(...);
$db->set_charset('utf8');

Setting session variables in your connection is not enough -- those affect what happens on the server-side. The set_charset will affect what happens client side.

You can checkout the PHP reference mysqli::real_escape_string

greezybacon
  • 556
  • 5
  • 15