0

I have a DEV DB with 16 million(ish) records. I need to 'mask' columns of personal data (name, address, phone, etc.). I found a nice function that will do the data masking wonderfully Howto generate meaningful test data using a MySQL function.

The problem is, when I call the function, it is only processing about 30 records per second. This is way to slow.

Is there anyway to speed this up. Maybe create a temp table or something.

Here is the UPDATE statement that calls the function.

    UPDATE table1 
        SET first_name = (str_random('Cc{3}c(4)')),
        last_name = (str_random('Cc{5}c(6)')),
        email = (str_random('c{3}c(5)[.|_]c{8}c(8)@[google|yahoo|live|mail]".com"')),
        address1 = (str_random('d{3}d{1} Cc{5} [Street|Lane|Road|Park]')),
        city = (str_random('Cc{5}c(6)')),
        state = (str_random('C{2}')),
        zip = (str_random('d{5}-d{4}'))

Thanks!!

Hadi
  • 36,233
  • 13
  • 65
  • 124
GreetRufus
  • 421
  • 2
  • 9
  • 19
  • Anything random is inherently slow in MySQL. There's not much you can do to speed that up as far as I know. – ydaetskcoR Aug 28 '14 at 19:13
  • Maybe generate the items procedurally instead of randomly? – Shadow Radiance Aug 28 '14 at 19:18
  • yes Shadow, that is correct. – GreetRufus Aug 28 '14 at 19:18
  • Sorry (edited my comment while you were answering it)... for example rather than a random value based on a formatting string - use an algo that generates the same gobbledygook string based on the id?... use hashing and then grab bits from the hashed string into your formatter? – Shadow Radiance Aug 28 '14 at 19:21
  • I understand the concept but would have no ideas how to do that in MYSQL – GreetRufus Aug 28 '14 at 19:23
  • Hashing (while faster) is more likely to hit duplicates due to the nature of your inputs - on the other hand - you know you'll always generate the same values for a given row (which might be nice to have? – Shadow Radiance Aug 28 '14 at 19:23

1 Answers1

0

Instead of calling a random function 7*16m times, it would probably be faster if you operated on procedurally generated text.

I checked out the str_random function you linked to. (That's very clever btw - cool stuff)

It calls RAND() once for each random character in the string and once each time you say "choose from list". That's a lot of rands.

I think one way to improve it would be to create and cache (in a table) a large set of random characters and instead of calling rand (say) 5 times for 5 random characters, call it once to determine an offset into the big string of random crap, then just increment the index it uses to pull from the string... (if it needs a bunch in a row - it can just pull them all at once in a row and multi-increment the offset)

The str_random_character function that the parent function calls could be replaced by something that does this instead of calling rand into an array.

It's a bit beyond me for a throwaway piece of code, but it might put you (or a better mysql guru) on a path for speeding this puppy up (maybe).


A different option would be rather than random-masking all the data... can you transform the data in some way? Since you don't need the original back, you could do something like a caesar cipher on each character in their data based on a (single) rand call for the rotation count. (If you rotate the uppers, lowers, and digits in each string separately, the data will stay looking "normal" despite not being easily reversible because of the randomized rotation) -- I wouldn't slap a SECURE sticker on it but it would be a lot quicker and not easy to reverse.

I think I have a Caesar rotator that does that somewhere if it suffices.

Shadow Radiance
  • 1,349
  • 13
  • 20
  • I would be interested in seeing the Caesar cipher. The update has already been running for 23 hours.....ugh....... – GreetRufus Aug 29 '14 at 12:52