technique to obfuscate clustered data and preserve privacy in r

Question

background

i have some private survey data that contains a column of confidential information: the geographic location of the survey respondents. under no circumstances can this information be released.

as is common in survey research, in order for users to correctly calculate a variance on my survey data set, those users will either need that geographic location (unacceptable) or, alternatively, a set of replicate weights. i can create that set of replicate weights; however, it's quite easy to look at the correlations between those weights and back-calculate which of the survey respondents share the same geographic location. that is also unacceptable.

to help me with this question, you don't have to be familiar with replicate weights -- just think of them as a few columns of strongly-correlated clustered data.

i understand that if i want to maintain that clustering, an evil data user will always have semi-decent guesses at who shares geographic locations; i just want to make that guessing game less precise. on the un-obfuscated replicate weights, an evil data user can figure out 100% of the cases.

request

i am looking for a technique that

prevents the public use file users from easily deducing the shared geographic location off of the correlations between my replicate weights variables
does not obliterate the correlations between my columns of data (the replicate weights variables)
can be implemented on an R data.frame object without a major time investment

i say shared because the evil user might not know where the location is, but they might know if two survey respondents are from the same location -- an unacceptable possibility.

what i have tried

i don't really want to re-invent the wheel here. i am looking for r syntax, an r package, or anything else that would be relatively straightforward to implement. i've found one, two, three, four papers describing techniques that would all be suitable for my purposes; unfortunately, none of the authors have been willing to share actual code to implement them.

i can do simple things like add and subtract random values to my replicate weights columns according to a normal distribution, but i'd prefer to rely on the work of someone who understands privacy issues better than i do.

thanks!!!!

You cannot. More than one data scientist/software guru has shown it's easy to extract personal identification from allegedly anonymized big data clumps. Your choice is either, as you noted, to leave a path for someone to reconstruct the geodata, or to remove the geodata entirely and do your analysis based on some other factor. — Carl Witthoft, Jun 13 '14 at 11:42
the united states census bureau regularly does what i am describing, despite their own strict confidentiality rules. let's lower the bar and say, "if it's good enough for census, it's good enough for me." i am hereby defining a new term: WWCD? thanks — Anthony Damico, Jun 13 '14 at 13:24
thanks @James i had never heard of that before! i spent some time trying to answer my own question with that toolkit. :) — Anthony Damico, Jun 15 '14 at 11:06

Anthony Damico · Accepted Answer · 2014-09-16T07:04:59.143

2

i have written this nine-step tutorial to walk through the process in an attempt to answer my own question. i am not an expert in the field of privacy/confidentiality and would love to hear both feedback about this idea and also other ideas. thanks!

http://www.asdfree.com/2014/09/how-to-provide-variance-calculation-on.html

edited Sep 16 '14 at 07:04

answered Jun 15 '14 at 10:38

Anthony Damico

5,779
7
46
77

the link is dead :-( – Dan Chaltiel Sep 01 '22 at 09:45
whoops, apologies.. blog post: http://usgsd.blogspot.com/2014/09/how-to-provide-variance-calculation-on.html and code: https://github.com/ajdamico/asdfree/tree/archive/Confidentiality – Anthony Damico Sep 02 '22 at 10:48

technique to obfuscate clustered data and preserve privacy in r

1 Answers1