1

There's business intelligence risk to exposing auto-incremented IDs to users. The methods that I commonly see people using to solve this are:

  • use GUID/MD5 (larger and slower than 4 byte INT)
  • generate integer, check for collision, repeat (slow, have to lock table)
  • expose "public" ID in the application (easy to leak actual ID, might be slow)

Instead, I created a table:

CREATE TABLE `foo` (
  `id` int unsigned NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci;

Then, every time I insert a row, I'd manually insert the ID like this:

insert into `foo` (id) values (
    last_insert_id((0xe8e5 * last_insert_id()) % power(2, 32))
);

The formula is a linear congruential generator. My understanding is that it would generate a pseudo-random integer between 1 and 2^32, with no collisions unless all 2^32-1 integers are exhausted. The parameter 0xe8e5 is from https://arxiv.org/pdf/2001.05304.pdf

Surprisingly, this even works with bulk inserts. Also, I tried inserting 100,000 rows and there weren't collisions.

I haven't seen anyone else do this before, are there potential issues with this?

Leo Jiang
  • 24,497
  • 49
  • 154
  • 284
  • 1
    Is it worked for insert multiple rows? since , `last_insert_id` returns the value generated for the **first** inserted row only. – Majid Hajibaba Jul 11 '21 at 13:24
  • Also `last_insert_id` is 0 for the first row and `last_insert_id((0xe8e5 * last_insert_id()) % power(2, 32))` always return 0. What's wrong here in my test and yours? – Majid Hajibaba Jul 11 '21 at 13:27
  • @MajidHajibaba Leo is using `last_insert_id(expr)`, which sets the value implicitly and would return a different value for each row (similar to a function that gets evaluated for each row.) `last_insert_id()` on its own would indeed not work as expected for multi-row inserts. – Solarflare Jul 11 '21 at 14:00
  • Thanks @Solarflare. Also the first value of last_insert_id() must not be zero. – Majid Hajibaba Jul 11 '21 at 15:41
  • Yes it works with bulk inserts, and you'll need to seed last_insert_id with a positive value – Leo Jiang Jul 11 '21 at 16:49
  • How much of risk is it to use auto-increment ID's? Because such risks are usually mitigated with cryptographic solutions. A LCG is not hard to solve given one or two outputs, and given the output of an LCG the position in the sequence can be determined easily for smallish LCGs like this, in which case you're no better than using an auto-incremented ID. – President James K. Polk Jul 11 '21 at 18:40
  • Does this answer your question? [How to generate a unique and random number for a primary key in sql](https://stackoverflow.com/questions/51167523/how-to-generate-a-unique-and-random-number-for-a-primary-key-in-sql) – President James K. Polk Jul 11 '21 at 18:48
  • @PresidentJamesK.Polk I just don't want to make it too easy, e.g. cryptographic would be nice, but might be diminishing returns. That solution just says to use UUID, which I mentioned in the question. – Leo Jiang Jul 11 '21 at 19:23
  • No, there were several answers, the [the last one](https://stackoverflow.com/a/56120914/238704) mentions the lcg among other things. – President James K. Polk Jul 11 '21 at 21:28
  • @PresidentJamesK.Polk - Your links are _not_ applicable to MySQL. – Rick James Jul 12 '21 at 00:02
  • Even if that code is correct... Who is going to maintain that code? – Fulldump Jul 14 '21 at 01:56

2 Answers2

5

There is a general technical issue with your specific approach, as last_insert_id() is session-specific. That means another session will not be informed about changes to that value, and will most likely reuse an id. Also you will have to initialize last_insert_id() when the session starts. Additionally, every use of an auto-increment column (for maybe a different table) will reset the value.

You should also be aware that, although it will obscure your values, this is not necessarily secure, see e.g. How can I determine the initial values of pseudo-random number generator if the sequence is given?.

Solarflare
  • 10,721
  • 2
  • 18
  • 35
1
  • I believe that MySQL's RAND() is a simple LCG with a period of 2^30.

  • Check out UUID_SHORT(). See this for how "unique" it is (or isn't): https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_uuid-short

  • A one-way hash, like MD5() is virtually unhackable. Something like MD5(6)) select benchmark(1000000, MD5(NOW(6))); says that it takes about 0.2 microseconds on my machine. I would think that that is plenty fast. Compare that to maybe 1000 microseconds to store a row containing the value.

  • LENGTH(UNHEX(MD5(NOW(6)))) = 16 bytes, so not terribly big.

  • If you have 9 trillion md5's, there is only one chance in 9 trillion that there is an unexpected dup.

Rick James
  • 135,179
  • 13
  • 127
  • 222