Can someone help me understand why I should or should not convert all my columns from latin1_swedish_ci to utf8_unicode_ci?

Question

I know this has been asked many times, but I'm having a hard time understanding the difference in collation and encoding.

OK, so we have a MySQL database (several actually) that have tables with mixed collation. Now, some columns are utf8_unicode_ci and some are the default of latin1_swedish_ci. We have since used the utf8_unicode_ci as the default. But we have a lot of "legacy" columns in the latin1.

We like consistency and would like to convert all columns to the utf8. Fortunately, 99.9% of our data is in English with the (literally) couple of records that have Spanish addresses (tildes). So pretty simple stuff for us but we want to do it the "right way" and fully support any language. Hence, the utf8_unicode.

We would rather not have to go through every column in every table and make that change. Is there a simple script that would covert every column to utf8_unicode_ci?

Also, what are some pitfalls we need to look for? I mean, if I'm wrong and there are records in Russian or Japanese that I'm not aware of then will I have problems?

And maybe this should be a side question, but for encoding, UTF8 is better because it can store any language, correct? As in, each character would have 2 or 4 bytes that represent what it should be in a "global" alphabet of thousands of characters? Sorry if that analogy sounds stupid but trying to understand. And collation is just how they are sorted when I query. So that if the characters were stored like A8 9F for the letter H and B3 2E for the letter e then H would come first because A8 comes before B3?

Ha, my head is spinning now. Sorry if this seems to be three different questions but when I go to management and say I need to convert columns, this is why....

Thanks

Standardization and Compatibility in future SQL Environments? Also, It will allow for different special characters to be used. So, depends what type of language (and with what special characters) you want to accommodate for. — TheGeekZn, Nov 15 '12 at 13:48
Right. That is actually my argument for converting everything to `utf8_unicode_ci`. Our apps are small but I would be embarrassed if we couldn't store Spanish or Russian correctly. — cbmeeks, Nov 15 '12 at 13:54
So converting to a collation that will not be able to store special characters wouldn't be beneficial. I would suggest converting to one that will accommodate it all. That being said, do you *really* have to convert, it you are able to use both without errors? Lastly, knowing this, are you still asking the same question? — TheGeekZn, Nov 15 '12 at 13:57
Well, that's what I'm trying to understand. I don't want to convert if I don't have to. I guess the root question is, to utf or not to utf. That is the question. :-) BUT, if I do convert, what is a good script to do so that doesn't involve a lot of manual conversions, etc. — cbmeeks, Nov 15 '12 at 14:00

score 1 · Answer 1 · answered Dec 06 '12 at 16:16

Simplest way that comes to mind is:

create a backup database create database olddb like dbname;
export all data and stucture to a dump: mysqldump dbname > dbname.sql
import everything into your backup: mysql -D olddb < dbname.sql
only export table structure: mysqldump --no-data olddb > structure.sql
open structure.sql with a text editor of your choice and replace all latin1_swedish_ci with utf8_unicode_ci
drop your database: drop database dbname to clear all tables easy.
recreate your database with utf8: create schema dbname default character set utf8 collate utf8_swedish_ci
import your altered structure into new database: mysql -D dbname < structure.sql. At this point your should have your entire old database in olddb, and an empty, fully utf8 database in dbname.
for each table in your database use a insert into dbname.tbl select * from olddb.tbl. if there are a lot of tables, you could generate a sql script to do this by exporting output of show tables from olddb and adding insert sqls around it.
Check the results, if you're satisfied, drop database olddb

This procedure could be a bit shorter, unfortunately MySQL no longer supports database renaming.

Can someone help me understand why I should or should not convert all my columns from latin1_swedish_ci to utf8_unicode_ci?

1 Answers1