1

I want to migrate my project from MySQL to postgres, I have one table in MySQL, in which utf8mb4 set for particular column in a table, what alternative is there in postgres to set in column for encoding?

Lajos Arpad
  • 64,414
  • 37
  • 100
  • 175
neetu
  • 11
  • 3
  • 1
    Well, Postgres uses UTF-8 encoding by default. Not sure what `utf8mb4` really does, but a proper UTF-8 encoding should include that –  Sep 17 '22 at 10:08
  • 1
    @a_horse_with_no_name `utf8mb4` is MySQL's way to represent 4-byte UTF8 encoding. – Lajos Arpad Sep 17 '22 at 10:30
  • @LajosArpad: but that's how UTF-8 is defined: encode characters in 1 to 4 bytes. Or is that a fixed width encoding? So the correct name would be UTF-32 then? –  Sep 17 '22 at 11:26
  • @a_horse_with_no_name `utf8mb3` for instance stores up to 3 bytes... – Lajos Arpad Sep 17 '22 at 15:53
  • @LajosArpad: so that means `utf8mb4` is essentially a standard UTF-8 encoding (whereas `utf8mb3` would be non-standard) –  Sep 17 '22 at 16:55
  • @a_horse_with_no_name yes. It is worth pointing out that in MySQL, the default encoding is `latin1` and if you didn't specify the encoding to be utf8mb4 and you want to have a UTF encoding, that's bad luck, you will need to convert then all the columns of all the tables and test them. – Lajos Arpad Sep 17 '22 at 16:58
  • @a_horse_with_no_name - UTF-16 (usc2) and UTF-32 go off in a different direction; they are not relevant here. UTF-8 and MySQL's utf8mb4 each encode any character (specific _so far_) in 1-4 "utf8 bytes". UTF-32 has _every_ character taking 4 bytes (ar rather awful way to go). – Rick James Sep 17 '22 at 17:48
  • 1
    Furthermore, MySQL's utf8mb3 and utf8mb4 are, technically, subsets of UTF-8. UTF-8 _can_ have 5-bytes characters, but none are defined yet. MySQL can't handle such, yet. (I don't know about Postgres.) – Rick James Sep 17 '22 at 17:50

1 Answers1

1

utf8mb4 is MySQL's way to represent 4-byte UTF characters, however, as the documentation clarifies:

Requires a maximum of four bytes per multibyte character.

So, actually not all characters are stored in four bytes. The OS is also not using up all the 4 available bytes for each characters, so you should be able to migrate your utf8mb4 characters into a UTF-8 encoded target field (MySQL - PostgreSQL) without problems, at least in theory.

But you never know whether this fits practice, so it is advisable to first create a backup of your MySQL database (so you will not be afraid of doing changes to it if for some reason you decide that the initial database needs some changes), export your database and modify your table's/column's definition to no longer use utf8mb4 as an encoding, but rather leave it unspecified (if you can rely on the fact that PostgreSQL has UTF-8 as the default encoding) or specify a UTF-8 encoding explicitly and run the inserts. Take a few samples of data from the original database and compare them to what PostgreSQL returns to them. If it works out of the box, then the theory was fitting the practice. If not, then you will need to research for the cause of the problem you experience.

Lajos Arpad
  • 64,414
  • 37
  • 100
  • 175