0

Since Mysql>=8.0 has supported collation in UTF8MB4.

But If the Column of type CHAR is an alphanumeric string, would it be better to custom collation in UTF8 or latin1 ?


I use Flask-Sqlalchemy, and my project sets SQLALCHEMY_DATABASE_URI = 'mysql+mysqldb://root:@localhost:3306/testdb?charset=utf8mb4'

But After upgrade mysql to 8.0 , all tables are created with collation of UTF8MB4 .

eg:

class Topic(db.Model, CoModel):
    id = db.Column(db.Integer, primary_key=True, autoincrement=True)
    name = db.Column(db.String(168))
    content = db.Column(db.Text)

==> mysql

CREATE TABLE `topic` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(168) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `content` text COLLATE utf8mb4_general_ci ,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

Should I custom name = db.Column(db.String(168, collation="utf8_general_ci"))


UTF8MB4 is good at support huge number of character encoding.

Should I configure utf8mb4 by default and use it everywhere ?

NicoNing
  • 3,076
  • 12
  • 23
  • 1
    If you use a lot, and I mean **a lot** of non-ASCII Latin-1 characters (i.e. "ä", "ü" etc.), but don't need any non-Latin1 characters, then you get some storage savings by using Latin1 vs. UTF-8. But that needs to be a pretty substantial amount to make any practical difference. – deceze Jun 14 '19 at 07:22
  • You are right. Now I need to alter table collation from utf8mb4 to utf8, then it raise a new question to me: https://stackoverflow.com/questions/58514911/why-tables-index-storage-size-is-bigger-after-change-charset-from-utf8mb4-to-ut – NicoNing Oct 23 '19 at 06:28
  • 1
    @NicoNing - I puked all over the accepted answer to that other Question. Bottom line for this Question: do _not_ change to utf8. – Rick James Oct 23 '19 at 23:58

1 Answers1

1

Going forward, you should use utf8mb4 for almost all CHAR/VARCHAR/TEXT columns.

CHARACTER SET utf8mb4 covers essentially all the world's character sets. If your client encodes characters as UTF-8 (the outside equivalent of utf8mb4), then utf8mb4 is good.

Moving from utf8 to utf8mb4 is good. The former is a subset of the latter. The difference is Emoji and some Chinese.

Mixing latin1 with utf8 or utf8mb4, is possible, but this forum is full of programmers/dbas who screw it up.

8.0 changed the default to utf8mb4 for a lot of good reasons.

Note MySQL's convention of xxxx_yyy_ci being a collation that applies to the character set xxxx. That is utf8_general_ci belongs with utf8, not utf8mb4.

A "character set" is an encoding. A collation is a set of rules for comparing strings. Example: Should 'A' be treated as equal to 'a'.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Your answers are right for most situation. But if data size is very huge and comfirmed only alphanumeric string , utf-8 or lantin1 is better. https://stackoverflow.com/questions/58514911/why-tables-index-storage-size-is-bigger-after-change-charset-from-utf8mb4-to-utf8 – NicoNing Oct 23 '19 at 06:35
  • @NicoNing - deceze makes a valid argument for latin1 (with caveats). But that does _not_ apply to utf8. All utf8 characters are encoded identically in utf8mb4, hence the same space requirement. – Rick James Oct 24 '19 at 00:01