0

The database for an application I manage use UUIDs. These UUIDs are stored as char(36) with a utf8 character set. From a strict performance point of view, I know this is not optimal. The general recommendation seem to be to use a binary(16) column for UUIDs. I cannot change the data type, but I can change the character set.

The characters in a UUID can be a digit 0 through 9, or letter a through f.

By changing the character set from utf8 to ascii, the total size of all indexes for the database will probably be reduced by several gigabytes.

The application connects to the database and explicitly sets character encoding and connection collation in the connection string : characterEncoding=utf8&connectionCollation=utf8.

What will I have to do (if anything at all) to ensure a safe "conversion" from utf8 to ascii for the UUIDs?

sbrattla
  • 5,274
  • 3
  • 39
  • 63
  • Just of note UTF8 will eventually be replaced with utf8mb4. ( This character set is deprecated in MySQL 8.0, and you should use utfmb4 instead.) – easleyfixed Nov 09 '22 at 19:54
  • One problem I could image is special characters .. it MIGHT work .. but its also possible that a character type won't pass properly as the previous encoding forced it to work. So if you did this, you might have to write some back end code to format the data to error correct the differences between the character handling (IF you experience any .. I suspect with Gigs of data you do ..) – easleyfixed Nov 09 '22 at 19:59
  • @easleyfixed a UUID has no special characters, only characters a through f and 0 through 9. – sbrattla Nov 09 '22 at 20:03
  • That was what I was thinking but wanted to make sure. So you are trying to go from HEX to ACSII ? While I think you could do that, the problem is I'm not sure the ASCII values would translate properly. Although if you are talking about saving gigs of data doing this and it still works fine, you might of found a giant performance booster. MAYBE you could parse each row through a converter API and do a few rows and test the table and see what happens. – easleyfixed Nov 09 '22 at 20:31

2 Answers2

2

Firstly, you might not get the storage space savings you are hoping for converting it to ASCII. The characters in UUIDs are all one byte in both ASCII and UTF8. I don't know the internals of mysql to say how much space it takes up. I suspect it may be like a varchar and have a few bytes to indicate the length of the string in bytes in which case you may only be saving a few bytes on each record.

If you insist on trying it, it is highly recommended that a backup of your database is taken before attempting such an endeavor. Also, it would be prudent to build a small test database to try out your changes. It is prudent to make sure that the applications that use the database still work after the change. You can change the character set and collation of a table's column using a simple SQL statement. You probably should do the same to all similar columns in the database. Changing it at the table or database level could be disasterous for your data.

You probably should also change the application so it initialises the database in the new way at setup or after a database restore.

In summary, it sounds like a high risk, low reward change to me. If there's little else on the database other than the UUIDs and you switch to binary(16) instead of ASCII, then you could get a meaningful space saving, but that would require application changes as well.

Simon Goater
  • 759
  • 1
  • 1
  • 7
1

Standard UUIDs (as opposed to home-grown ones) use only hex, which is a subset of ascii characters which is a subset of utf8. The encoding (how the bits are arranged) is identical. Hence, no data loss in the conversion.

One thing to be careful of. If you are JOINing on a uuid, do make sure the CHARACTER SET and COLLATION of the column in both tables is the same. Otherwise, there will be a big performance hit. (MySQL is not smart enough to understand that you have simply hex.)

Also, check the collation -- if it is ..._ci, then "a" == "A", etc. This would be beneficial if you might need case folding. (..._bin treats a..f as different than A..F.)

In InnoDB, with either VARCHAR or CHAR, ascii or utf8mb4 with virtually any length of at least 36, works identically.

Yes, do have a backup handy, just in case.

For large tables, UUIDs have an unavoidable performance problem. I discuss it here: http://mysql.rjweb.org/doc.php/uuid

Rick James
  • 135,179
  • 13
  • 127
  • 222