0

Well... Html pages and mysql tables contain cyrillic text. For displaying the cyrillic text Барысаў2000 I use

<meta http-equiv="Content-Type" content="text/html; charset=windows-1251" />

on the web-page. For storing that word in MySQL table, utf8_unicode_ci collation is used (I've read some topics and, as I understand, utf8_unicode_ci is recommended for storing cyrillic symbols). But, what I actually see using phpMyAdmin, the text Барысаў2000 is stored as Áàðûñà¢2000 in the db, and that's the problem I wish to solve. (POST method + escaping dangerous symbols are used to save user's text into db). But, when you SELECT that data and display it on the html page, it looks fine: Барысаў2000.

The problem how phpMyAdmin displays it for me didn't bother me until today. Today I've tried to solve it.

I guessed I have to use utf-8 everywhere, so I switched from

<meta http-equiv="Content-Type" content="text/html; charset=windows-1251" />

to

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

Now my pages display questions instead of cyrillic symbols and the question with displayng cyrillic text in my db was not solved. Who can tell me what's the problem? P.S. I can read serbian and belarusian (cyrillic languages) web-sites without any problems and can type cyrillic texts on my localhost.

Thank you.

Owen Blacker
  • 4,117
  • 2
  • 33
  • 70
Haradzieniec
  • 9,086
  • 31
  • 117
  • 212
  • You must use utf-8 everywhere - db tables collation, html charset. I normally use utf8_general_ci for tables rather then utf8_unicode_ci. Just in case you might specify mysqli_set_charset($connection, 'utf8'); after setting connection variable - http://php.net/manual/en/mysqli.set-charset.php. You've obviously had it wrong from the beginning and data in your db is not encoded properly, hence it's not gonna show on the page properly. Also make sure that your php files are saved as 'UTF-8 without BOM' – AR. Nov 23 '11 at 06:08

1 Answers1

2

Problem with phpMyAdmin is probably caused by incorrect character encoding guessing. If you encode the text Барысаў2000 using charset windows 1251 you'll end up having a byte stream C1 E0 F0 FB F1 E0 A2 32 30 30 30 0D 0A. If this byte stream is interpreted as text that with ISO-8859-1 or windows-1252 encoding, the result is shown as Áàðûñà¢2000.

This suggests that the strings in your database are really stored with windows-1251 encoding. Then if you output these strings and only claim that it uses UTF-8 encoding (without doing any recoding), the result will be garbage text because that byte stream contains invalid UTF-8 byte sequences.

You should either continue serving your pages with windows-1251 charset and tell phpMyAdmin to use this charset too or you should switch to unicode everywhere (also internally, in the database). The less character conversions and guessing the proper encoding you need, the easier it will be to maintain your system.

jasso
  • 13,736
  • 2
  • 36
  • 50
  • Thank you for your thoughts. I'm still thinking of the problem and can't solve it. The mystery for me is that even if I set up UTF-8 in meta on html page, even every text field in the db uses utf8_unicode_ci, and even if I DO NOT DISPLAY TEXT FROM THE DB ON THE PAGE, BUT PUT SOME CYRILLIC TEXT ON THE PAGE (so I exclude a possible wrong interpretation of phpMyAdmin), and even cyrillic symbols removed from db, then English text&numbers are OK on the html page, but Cyrillic text is even not Áàðûñà¢2000, but ???????2000. Although ANY Cyrillic web-site in FF or IE looks fine.2 days of thinking why. – Haradzieniec Nov 24 '11 at 19:01
  • @Haradzieniec Try to switch the character encoding of your browser and tell which charset seems to work. Then you'll know what encoding your HTML file *really* uses. Note that the `` element may be ignored if the encoding is defined in the HTTP headers. Also if you generate the HTML page from several input sources, the result may end up using more than one encoding (which of course is a problem). – jasso Nov 24 '11 at 20:15
  • This is a very strange... I started simplifying the code... And now... the same cyrillic 9 symbols word in one file is 17 bytes (that is rectangles on html page), in another file is 7bytes. Both files display 9cyrillic symbols and look the same in notepad. Notepad displays the same 1 word, no spaces, no new lines. Is it possible? As I understand I can't attach files on the forum by security reasons... P.S. Autodetecting encoding says it's Widnows (Cyrillic) and display it OK only when it's Windows (Cyrillic), but not utf-8. What a hell.... – Haradzieniec Nov 26 '11 at 07:07
  • @Haradzieniec Notepad can handle and autodetect Unicode (UTF-8 and UTF-16) and some 8-bit encodings, so the 17 bytes word probably uses some Unicode encoding and the shorter uses some 8-bit encoding. Notepad recognizes the encoding and displays the text correctly, although handling Unicode files with Notepad is inconvenient. If your browser shows the text correctly only when it is set to use "Windows (Cyrillic)" encoding, it proves that you really don't serve the files as Unicode to the browser but use some 8-bit encoding instead. – jasso Nov 26 '11 at 11:59