0

I often use production data in my dev environment for testing. However, due to some sensitive data, I need to do a data anonymisation. I have identified the sensitive data, like name, address etc... and for the name field, for example, I am planning just to do an update which will set the values to a random one. I was wondering however if this is an effective way of data anonymisation. Any idea?

refresh
  • 1,319
  • 2
  • 20
  • 71
  • As I understand it, it may not be very effective at all. Also see the [AOL search data leak](https://en.wikipedia.org/wiki/AOL_search_data_leak). – jww Jun 06 '17 at 10:16
  • @jww : Can you explain why please? – refresh Jun 06 '17 at 10:53

2 Answers2

1

There are a couple of other tools out there availalbe, but most of them are very expensive. We are currently searching also for a solution.

If you don't want to use an commercial ETL-Tool, IMHO it might be a good solution to use nifi and add some hash-functions. You will find nifi here: https://nifi.apache.org/

But this will only work for you, if you are able to write your own plugin e.g. in Java. While searching for, this pdf-dokument was very helpful for me to understand what is really necessary: http://www.odbms.org/wp-content/uploads/2014/03/The-Complete-Book-of-Data-Anonymization_Chap_1.pdf

Another possibilitie is to write your own small ETL-Tool, eg. in Python or just SQL (will only work, when connecting source and target are equal). If you are not a developer, nifi is probably not a way to help you and same for writing you own tool. In this case you should have a look for a commercial solution, but in most times for dev and testing i would recommend to keep it small and simple. Not a big and expensive one like informatica or ab initi. Maybe this tools / links will help:

https://anno.io/
https://www.owasp.org/index.php/Anonymization https://open-ls.de/en/anonymization-knoxxer/

In case you are going to create your own solution, it might be O.K. just to use e.g. a shellscript and access SQL-Plus for Oracle. But it really depends on what you are trying to do.

-1

You could do this manually as you suggest by replacing personal info with random strings. Even better, if you want to maintain some validity libraries like faker for python can help. If you do this with any sort of regularity though hardcoded solutions will end up falling over with schema changes.

There’s also a bunch of mathematical theory around the best ways to anonymise a dataset and there's plenty of examples of sensitive data being linked back to an individual. This is often because the dataset wasn’t properly anonymised or it was combined with publicly available data. However, it's definitely safer to work with anonymised data in test.

d4nyll
  • 11,811
  • 6
  • 54
  • 68
rimeice
  • 141
  • 5