7

I have been researching and reading about SQL data types for a few days now (I know... I know, that's not very long) and one of the things that is hard for me to grasp is how to pick the best data type for expandability, efficiency, and ease of access.

I think it's pretty straight forward how to choose basic Data Types (i.e. int vs varchar) but how do you choose between things like blob and text type.

The MySQL man pages are great but they are not what we computer geeks love... efficient.

I think it would be awesome if we could compile a list of the MySQL data types, the general advantages/disadvantages of each and when it would make sense to choose that data type.

austinbv
  • 9,297
  • 6
  • 50
  • 82
  • 1
    http://www.google.com/search?q=performance+of+mysql+data+types (anything on the first page) – Ben Apr 08 '11 at 00:42
  • I like the question, and hope for a well thought out answer, this could really help the speed on my servers (if I picked the wrong types). Meanwhile, this article had something to do with that (and the problems with choosing the wrong datatype) http://m.pinboard.in/blog/173/ – Jess Apr 08 '11 at 01:05
  • @Ben of course a link to google, you don't think by reading my question that I have spent some time googling and reading about what is best. The problem is there is no compilation of that information in an efficient manner with a good description of when to use each. If you read 5 or 6 articles you get a good amount of info to start but we could match that here in a wiki style list easily. It could be editable and used much better than sifting though google results. The best article that search has is peachpits and it is still incomplete. :/ I think an answer here could be much better. – austinbv Apr 10 '11 at 00:35
  • @zobgib That isn't what StackOverflow is about. See the FAQ: http://stackoverflow.com/faq – coreyward Apr 10 '11 at 00:42
  • "I think it would be awesome if we could compile a list" -- then you're in the wrong place, I'm afraid – Michael Mrozek Apr 10 '11 at 00:42
  • @corey & Michael I read over the before I asked and it seems like a reasonable question. It is a problem I face, I do not know when and where it is most efficient to use which data types. It is not open ended, it is actually really specific, and the question has an answer that the community can contribute to. If I am wrong I am sorry but this is something I really do wonder about and think could help me design better sites. – austinbv Apr 10 '11 at 00:54
  • @zobgib You're asking for us to compile documentation for you in a friendlier format. That isn't a targeted question. – coreyward Apr 10 '11 at 00:58
  • @corey isn't that what most of these questions about how to use the jQuery ".each()" function or why my css doesn't work are. If people really read the docs their questions would be answered. What I am asking for is a really neat little compilation of what people have found is best, or what is just best. I offered a bounty rather than letting it sit because this really is something that would really help me. Sorry if you don't think it's right – austinbv Apr 10 '11 at 04:59
  • @zobgib If you can't understand the difference between getting an answer to a specific question and getting the community to compile a list of best practices and guidelines, I'm afraid there isn't anything I can do to help you further. – coreyward Apr 10 '11 at 05:47

3 Answers3

12

MySQL string types come in two variants: One without a character set label and one with a character set label.

A fixed length string, padded with spaces at the end, is CHAR(n). The matching type that has no character set label is BINARY(n). Storing the string "hello" in a CHAR(255) CHARSET utf8 will take 765 bytes (string padded with spaces to full length, stored as utf8 which as a worst case space usage of 3 bytes/character allocates 3*255 bytes).

A variable length string with one or two length bytes and no padding is VARCHAR((n). The matching type that has no character set label is VARBINARY(n). Storing the string "hello" in a VARCHAR(255) CHARSET utf8 will take 6 bytes (1 length byte plus 5 bytes for the actual text). Storing the string クリス in the same type will take 10 bytes (1 length byte plus 3 characters using 3 bytes per character to represent them).

mysql> select hex('クリス'), length(hex('クリス'))/2 as bytes;
+--------------------+--------+
| hex('クリス')      | bytes  |
+--------------------+--------+
| E382AFE383AAE382B9 | 9.0000 |
+--------------------+--------+
1 row in set (0.02 sec)

A variable length string with one, two, three or four length bytes is TINYTEXT, TEXT, MEDIUMTEXT and LARGETEXT. The matching types that have no character set label are TINYBLOB, BLOB, MEDIUMBLOB and LARGEBLOB.

A TEXT/BLOB-like type differs from a VARCHAR/VARBINARY-like type in how and where the data is stored, see http://www.mysqlperformanceblog.com/2010/02/09/blob-storage-in-innodb/ for details on how TEXT/BLOB-like types are stored in InnoDB depending on version and ROW_FORMAT settings. For performance reasons, you want a recent version of InnoDB and "Barracuda"-Format tables.

MySQL is incapable of working with any data that is larger than max_allowed_packet (default: 1M) in size, unless you construct complicated and memory intense workarounds at the server side. This further restricts what can be done with TEXT/BLOB-like types, and generally makes the LARGETEXT/LARGEBLOB type useless in a default configuration.

For types without a character set label (BINARY, VARBINARY and %BLOB%), MySQL will accept the data as received and write it to disk. For types with a character set label, MySQL will look at what you announced as your clients character set to the server with SET NAMES and what is the columns defined character set label. It will then convert from the connection character set to the column character set and write the converted data. You can check that with the HEX() function, e.g. SELECT HEX(str) FROM t WHERE id = ....

On retrieval, the connections announced character set with SET NAMES can be different from what it had been at write time. MySQL again will check the columns character set label against the character set announced for this connection, and if necessary, will convert to the connection character set.

The performance penalty for that conversion either way is negligible compared to the time taken for the disk I/O incurred for such data anyway, to performancewise it hardly matters which type you choose. The rule is instead: Chose a type with a character set label, if you are working with textual data, and a type without if you are not.


A related question often asked: Shall I choose CHAR or VARCHAR (BINARY or VARBINARY respectively)?

For InnoDB, the answer is always: Chose the variable length data type. There is no performance benefit from fixed length data types in InnoDB, ever, but there is a huge size penalty if you are choosing a fixed length data type and then are not using all space in it. Plus fixed length SQL string types have really weird rules regarding padding and trimming with spaces at the end which you probably can't be bothered to learn. For MySQL the case can be different, but almost never is.


Another related question: Shall I choose VARCHAR or TEXT for my strings (VARBINARY or BLOB, respectively)?

The answer for that is using a recent version of InnoDB, Barracuda format tables and then TEXT/BLOB. The reason for that is explained at length in http://www.mysqlperformanceblog.com/2011/04/07/innodb-row-size-limitation/. The upshot of that is: With either VARCHAR or TEXT/BLOB in pre-Barracuda format you run the risk of overflowing the InnoDB row size limitation if you have too many of them in a single row.


And finally: Shall I store files/images/other large blob or text data in the database?

The answer for that is: Usually not. Serving files from a database (http://mysqldump.azundris.com/archives/36-Serving-Images-From-A-Database.html) is an expensive operation compared to serving files from a filesystem. If at all possible, you'd want to do that instead. There is a way around that, http://www.blobstreaming.org/, but that is advanced technology that requires that you have total control of your execution environment, which is never the case in a hosted environment.


To round this off: There are no variable length data types in MEMORY engine tables. So if you are seeing "using temporary" in an EXPLAIN output, this means

  • VARCHAR is converted to CHAR in that temporary table
  • VARBINARY is converted to BINARY

If the temporary table by this process becomes larger than tmp_table_size OR max_heap_table_size, it is on the fly converted to MyISAM format and goes to disk.

Example: You are defining a Ruby Active Record class User with contains ten fields tagged as :string. Each of these end up being VARCHAR(255) CHARSET utf8 in your Users table.

Elsewhere in your codebase, Users is being used in a way that involves a plan using temporary. You are instantly dying in disk operations under load, because each row of the Users table is now using 7650 bytes in MEMORY at least, most of them being spaces used as padding. This forces the temporary table to be converted to MyISAM and written to disk.

  • any %TEXT% or %BLOB% type cannot be represented in MEMORY, so the temporary table goes to disk as MyISAM even if it would have been sufficiently small to be kept in memory according to the limits stated above.

This implies that any query with a TEXT or BLOB type and a plan with "using temporary" needs to be rewritten in order to avoid temp tables hitting the disk.

Isotopp
  • 3,313
  • 1
  • 16
  • 17
1

For MySQL there is a procedure called analyze that will assess data heuristics with the idea that it informs the best choice for a data type and will suggest a range or values for enumeration.

A quick dynamic concat script to generate the SQL to run

select CONCAT(' SELECT ', COLUMN_NAME, ' FROM ', TABLE_NAME, ' procedure analyse() ;' ) 
FROM INFORMATION_SCHEMA.COLUMNS   
WHERE table_schema ="yourDbName"   
AND DATA_TYPE ="varchar"   
AND CHARACTER_MAXIMUM_LENGTH > 190   
AND COLUMN_KEY not in (' ') ;

** SQL above does not evaluate PKs -- assuming they are not text fields

The procedure is useful when looking to change a datatype based on data usage or to gain more efficiencies by moving or storing a smaller data packet.

Percona Blog has a good working example of procedure analyze that applies to Drupal. https://www.percona.com/blog/2009/03/23/procedure-analyse/

Some of that research is done for compression which ties into longer utf8mb4 indexes http://techblog.constantcontact.com/devops/space-the-final-frontier-a-story-of-mysql-compression/

1

With regards to BLOB vs TEXT (since that's the only concrete question in your post): BLOB is for binary data, and TEXT is intended for text data.

It's usually pretty simple to use the most specific type of column that fits your needs, and fallback to the generic types if none of them fit your usage.

coreyward
  • 77,547
  • 20
  • 137
  • 166