Shuffling as anonymization technique for public data

Question

Shuffling has been approved as a data de-identifcation technique by the EU Protection Board in Opinion on Anonymisation Techniques, 05/2014”. However, there has been very little discussion of appropriate use cases and risks. Talend, Informatica, Oracle and others support various forms of shuffling data and Fisher-Yates is a well-known algorithm.

Shuffling, similar to noise addition, may not provide full anonymisation by itself and usually is combined with other de-identification techniques.

Do examples of open public data exist where shuffling as been successfully used as a part of de-identification? Particular concerns with shuffling include which algorithm was used and how k-anonymization was applied to quasi-identifiers.

As it sounds like already suspect, shuffling is not a good technique. It provides no information-theoretic guarantees and the sensitive permuted values themselves can often be linked back to a specific record. For instance, the maximum salary in a company salary table likely belongs to the CEO. Randomized response provides an alternative to shuffling. It differs in that the values are allowed to change, and can be performed in a way that provably limits the ability of an adversary to perform inference attacks. I am happy to expand in to a full answer if there is interest. — Alfred Rossi, Jun 11 '20 at 12:22
Thanks for your reply. The EU Data Protection Board acknowledged “Similarly to noise addition, permutation may not provide anonymisation by itself and should always be combined with the removal of obvious attributes/quasi-identifiers.” Any published, peer reviewed research challenging the EDPB recommendation on shuffle would certainly be interesting. — Brad Schoening, Jun 11 '20 at 19:33

Shuffling as anonymization technique for public data

0 Answers0