0

I have written data crawling engine which does (so far) everything right except it completely stops data inserting process at the point where euro € symbol ocurres in the text. I belive this symbol may not be the only one.

If string that is about to be inserted into database table looked like this:

Quick brown € fox jumped over the laizy dog.

this would end up in database:

Quick brown 

My trouble is that I've followed several instructions found on this site about precisely this issue, with no luck.

What I have already done to fix this(php-side):

made sure that:

mysqli_set_charset($dblink, "utf8");
mysqli_real_escape_string($dblink, $string);

tried with:

iconv('windows-1250','utf8//TRANSLIT', $string );
iconv('windows-1250','utf8', $string );/*crawled web page declared windows-1250 characterset*/

(here I had success finishing data insertion with € symbol preserved, however other latin characters čćšđž was completely degraded)

  • several others attempts with mb_convert_encoding()
  • attempts to change server response with of Accept and Accept-Charset headers

mysql side:

Table and fields was set to utf8_unicode_ci

tried to fix DEFAULT CHARACTER SET to utf8 (did that to existing, non-empty database)

No success. Whatever I attempt to, insertion process stops with € symbol.

Does anyone know what should I do?

Miloš Đakonović
  • 3,751
  • 5
  • 35
  • 55
  • Are you using a prepared statement for the insert? – Barmar Feb 28 '13 at 21:05
  • Where is your data coming from? – Pekka Feb 28 '13 at 21:06
  • Does changing to prepared statements solve the problem? – Barmar Feb 28 '13 at 21:11
  • Ahm... little to no experience with prepared statements... I could give a shot. – Miloš Đakonović Feb 28 '13 at 21:12
  • Prepared statements aren't going to change a thing. The big question here is where you get the data from, and what encoding it is in. – Pekka Feb 28 '13 at 21:13
  • @Pekka웃, file_get_contents() with created context of http headers. – Miloš Đakonović Feb 28 '13 at 21:13
  • What encoding is the site you're parsing? – Pekka Feb 28 '13 at 21:14
  • windows-1250 declared in – Miloš Đakonović Feb 28 '13 at 21:14
  • @Miloshio if the browser's auto sniffing mechanism agrees with that declaration (visit the page and check out the "encoding" menu), your iconv() method should work. If you do the `iconv()`, are the broken characters `čćšđž` on the same page? How do they degrade exactly, can you show an example? – Pekka Feb 28 '13 at 21:15
  • Http headers does not gives characterset info, source is 'declaring' windows-1250, and without touching anything everything is perfect... except that with euro character. I do not belive that if I paste those characters in this site they would stay preserved – Miloš Đakonović Feb 28 '13 at 21:23
  • How exactly do those characters break when the Euro is preserved? – Pekka Feb 28 '13 at 21:23
  • They turn into some like on this image http://www.asciitable.com/index/extend.gif – Miloš Đakonović Feb 28 '13 at 21:30
  • Hmm, that looks as if either the Euro symbol is not in the declared encoding, or the other characters. Not sure what is going on here... it might be worth looking at the web site's data using a hex editor and seeing which character code the Euro symbol has – Pekka Feb 28 '13 at 21:32
  • @Pekka웃 , I've tried to output that, problem does not seems to be in php string itself, since I can echo full string in that very case when only a part ends up in mysql database. – Miloš Đakonović Feb 28 '13 at 21:33
  • When you echo the full string on your PHP page, what encoding does the browser autodetect? – Pekka Feb 28 '13 at 21:36
  • Browser chooses Unicode (UTF-8) – Miloš Đakonović Feb 28 '13 at 21:40
  • @Pekka웃 I made workaround with `iconv('windows-1250', 'UTF-8//TRANSLIT', $string);` then `str_replace('€', '', $string);` and finaly reverse iconv'ing `iconv('UTF-8', 'windows-1250//TRANSLIT', $string)`. That works, since I do not feel sorry about retaining euro symbol. Now my consideration is: **is the euro symbol only one I should sanitise this way?** Does this new situation discovers something to you? And, please, write answer, so I could reward your efforts. – Miloš Đakonović Mar 01 '13 at 11:24
  • I'm not sure what is wrong here - it could be that it's the web site that is giving you the wrong Euro sign, I'm not sure. (Don't worry about the answer, if there is one that works for you, feel free to accept that.) – Pekka Mar 01 '13 at 11:26

2 Answers2

1

I've tested these on my php/mysql install, and both work. Note, I couldn't get euro to work on my command line since my character encoding didn't support it, so I selected it out of a DB first which worked perfectly.

$original = 'Quick brown '.$euro.' fox jumped over the laizy dog.';
$escaped =  str_replace($euro,'\\'.$euro,$original);
echo $escaped;
$database->insertSQL('insert into test (text) values (\''.$original.'\')');
$database->insertSQL('insert into test (text) values (\''.$escaped.'\')');

Both the original and escaped insert correctly into my DB, and the escaped doesn't have the backslash since SQL knows how to handle it.

The only thing I can think might be incorrect with your DB, is the encoding. My encoding is latin1_swedish_ci. If you change the encoding on your table, does my code block above help?

uNople
  • 554
  • 5
  • 10
  • Well, you gave me nice idea of turning problematic character in some non-problematic custom representation, I'll see if it could fix me. What remains is trouble that I do not know if this Euro sign is only one that is problematic, and I have bad feeling that it is not... – Miloš Đakonović Feb 28 '13 at 21:58
  • Yeah, I found the same thing with certain characters that weren't escaped by mysql_real_escape_string. My workaround was the same as above - escape them with backslash so that then insert. – uNople Feb 28 '13 at 22:34
-1

Those characters are multibyte. So when you get it into var đ is somewhat like %71%41 or something like that. That is a place where your problem occurs.

Let me know if you need more detailed help Miloše :)