1

What would be a good/safe maximum length for a VARCHAR column as primary key not being much/any slower than an INTEGER ID using MySQL 5 + InnoDB on a 64 bit system? Note, that this PK should be assumed to be referenced by other tables, so it will appear in a number of JOINs.

Would a VARCHAR(7) be a good length? 6? 8? 10? More? Less? Why?

It might be hard to answer, but there should at least be an upper limit based on facts, e.g. based on the inner workings of MySQL/InnoDB (index structures, ... ?).

Edit: Assume the ASCII character encoding, case sensitive.

Kawu
  • 13,647
  • 34
  • 123
  • 195

3 Answers3

3

Using an int field is generally preferable for keys. Two numbers can be compared in a single assembler instruction on pretty much any platform in existence. Comparing two strings will invariably require a loop and extra setup steps in advance, or take multiple cpu cycles even if the cpu has string comparison instructions built in.

Marc B
  • 356,200
  • 43
  • 426
  • 500
  • 1
    That was not the question! The question was which size of VARCHAR would probably be as efficient as an INT, not if an INT is preferrable or such. That's more of a design issue. – Kawu Aug 09 '11 at 21:30
  • 1
    That is the answer - text keys are strings which have extra overhead for comparisons. Any string-based key is going to be slower than a simple integer for pretty much any operation you can do in a database. The longer the string, the larger the overhead. Do some benchmarking with various lengths and see where the overhead starts costing you too much. – Marc B Aug 09 '11 at 21:33
  • I have doubts that a VARCHAR(4) or something like that is much slower than an INT. I'd rather think it'll depend on the character encoding. But that's still not the question. I'm looking for people who have done such benchmarks and would share what they have found concerning the maximum size of a VARCHAR that's basically not noticably slower, even on a large database. Some claim you won't notice the difference at all, but I've used one that was a VARCHAR(50) IIRC and it **was** noticably slower. Now I want to determine a kind of maximum size that would probably work most of the time. – Kawu Aug 09 '11 at 21:39
  • varchar(4) isn't necessarily 4 bytes, though. it could be something like 12 or 16 bytes (or more) if it's utf-8 or utf-16 and using obscure/uncommon characters plus diacritics. This is part of the string overhead I'm talking about. – Marc B Aug 09 '11 at 21:41
  • Assume non-UTF, simple ASCII, case sensitive. – Kawu Aug 09 '11 at 21:43
  • Then it depends on the database being smart enough to realize than 4 8bit ascii chars can be loaded into 32bit registers (or 8 chars into a 64bit register) and directly tested for equality, which comes down to the compiler and original coder recognizing such a case and explicitly handling it without resorting to string comparisons. – Marc B Aug 09 '11 at 21:51
  • Unless you want to dig through mysql's key handling code, the only way to determine if this is the case is to run a long series of benchmarks. – Marc B Aug 09 '11 at 21:52
  • `ascii_bin` is at least supposed to act this way, but I can't really find any confirmation. – Mchl Aug 09 '11 at 21:55
  • This is what my question was directed to: InnoDB and internal MySQL structures. I really wonder how MySQL primary key indexes affect the performance of this. – Kawu Aug 09 '11 at 21:56
2

Anything larger than VARCHAR(4) in ASCII encoding (and ascii_bin collation - you don't want case insensitive collation for relational operations) will be slower than INT (because INT is 4 bytes long)

Mchl
  • 61,444
  • 9
  • 118
  • 120
  • Even on a 64bit system (as asked)? I have doubts with this naive view on the topic. – Kawu Aug 09 '11 at 21:29
  • Yes, an INT in MySQL is always 4 bytes, even on a 64 bit system. – Ike Walker Aug 09 '11 at 21:36
  • INT is 4B regardless if system is 32 or 64b, and being 64b will not make such operations any faster than on 32b platform. Also Marc B has pretty good arguments in his answer. – Mchl Aug 09 '11 at 21:37
  • OK, but shouldn't a 64 bit system be able to handle a case sensitive, ASCII encoded VARCHAR(x < 8) nearly equally as fast? I still have doubts that an INT is always faster, also given that so many people claim VARCHAR to be not noticably "slower". – Kawu Aug 09 '11 at 21:47
  • I remember Assembler/C times where the use of 16 bit types on a 32 bit system was actually **slower** than using a 32 bit type. I don't know if this still holds true for today's processors, just to give you another thought of mine. – Kawu Aug 09 '11 at 22:02
  • I did a very simple benchmark with SELECT on 2M rows table with INT vs CHAR(32) PK. No significant difference. But again, it was a simple SELECT, with no joins, no range queries etc. Also insert times are significantly higher with CHAR(32) PK (Whicha again is kind of expected, since I used MD5 hashes, so inserts were not sequential, and InnoDB's clustered index doesn't like that) – Mchl Aug 09 '11 at 22:56
  • `CHAR(32)` is unfair, because that's (at least) 32 bytes. But `CHAR(4)` with `ascii_bin` collation... I'm truly curious if MySQL will compare that as a single binary comparison the same was as `INT`. – Timo Sep 03 '18 at 15:37
2

What would be a good/safe maximum length for a VARCHAR column as primary key not being much/any slower than an INTEGER ID using MySQL 5 + InnoDB on a 64 bit system? Note, that this PK should be assumed to be referenced by other tables, so it will appear in a number of JOINs.

Not much slower to do what? SELECT? UPDATE? INSERT? My internal users want faster inserts; my web users want faster selects.

In part, performance (whatever that means) depends on your particular database structure, your particular query patterns, and your particular server hardware. What did your own tests show you?

It might be hard to answer, but there should at least be an upper limit based on facts, e.g. based on the inner workings of MySQL/InnoDB (index structures, ... ?).

If there were an upper limit, you wouldn't be able to count on it remaining unchanged across even minor version upgrades. And, of course, the query optimizer makes decisions at run time, not at design time. Basing long-term database design decisions on dbms internals as they exist today is not a Best Practice. (That's just an observation. I'm not implying that's what you're doing, but a lot of people who read this are liable to do just that.)

If you want to know how a particular set of tables perform, it makes sense to edit your question and include the DDL. That way we're at least talking about the same thing. As it is now, everybody that answers is probably going to be using a different structure. (If they bother to test at all.) And we might not reveal our private--and sometime unwarranted--assumptions.

In one specific case--from another SO question--using id numbers and joins took 100 times as long to execute as using a natural key. (32-bit PostgreSQL.) So people can talk all day about how many CPU instructions it takes to compare integers or strings, or the number of bytes in an integer, or the number of bytes in UTF-8 collations, or whatever. Nevertheless, in that specific case, VARCHAR(30) won by a landslide.

When I was in the military, we had a saying. "When your map and the terrain disagree, follow the terrain."

If theory and measurements disagree, follow the measurements. Develop rules of thumb from measurements.

Community
  • 1
  • 1
Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185