3

What's the most efficient method or tool to randomize a list of database table columns to obscure sensitive information?

I have a Django application used by several clients, and I need to onboard some development contractors to do work on the application. When they work on bugs (e.g. page /admin/model/123 has an error), ideally they'd need a snapshot of the client database in order to reproduce and fix the bug. However, because they're off-site contractors, I'd like to mitigate risk in the event they expose the client database (unintentionally or otherwise). I don't want to have to explain to a client why all their data's been published online because a foreign contractor left his laptop in an unlocked car.

To do this, I'd like to find or write a tool to "randomize" sensitive fields in the database, like usernames, email addresses, account numbers, company names, phone numbers, etc so that the structure of the data is maintained, but all personally identifiable information is removed.

Presumably, this is a task that many other people have had to do, but I'm not sure what the technical term is, so I'm not finding much through Google. Are there any existing tools to do this with a Django application running a MySQL or PostgreSQL backend?

Cerin
  • 60,957
  • 96
  • 316
  • 522

2 Answers2

1

Anonymize and sanitize are good words for this chore.

It's relatively easy to do. Use queries like

 UPDATE person
    SET name = CONCAT('Person', person_id),
        email = CONCAT('Person', person_id, '@example.com')

and so forth, to stomp actual names and emails and all that. It's helpful to preserve the uniqueness of entries, and the autoincrementing IDs of various tables can help you do that.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • This might work for simple fields, but this won't work for more complicated fields like address, phone numbers, or account numbers that need to match a specific format but yet still be globally unique. – Cerin Jan 16 '17 at 17:17
1

(Adding this as an answer, as I am not allowed to comment yet.)

As Cerin said, O. Jones approach for anonymizing/sanitizing works for simple fields, but not more complicated ones like addresses, phone number or account numbers that need to match a specific format. However, the method can be modified to allow this too.

Let's take a phone number with format aaa-bbbb-ccc as an example and use the autoincrementing person_id as the source of unique numbers. For the ccc part of the phone number, use MOD(person_id,1000). This will give the remainder of person_id divided by a 1000. For bbbb, take MOD((person_id-MOD(person_id,1000))/1000,10000). It looks complicated, but what this is doing is taking person_id, removing the last three digits (which were used for ccc), then dividing by a 1000. The last four digits are taken from the resulting number to use as bbbb. I think you'll be able to figure out how to calculate aaa.

The three parts of the phone number can then be concatenated to give the complete phone number: CONCAT(aaa,"-",bbbb,"-",ccc)

(You might have to explicitly convert the numbers to string, I'm not sure)