9

background

i have some private survey data that contains a column of confidential information: the geographic location of the survey respondents. under no circumstances can this information be released.

as is common in survey research, in order for users to correctly calculate a variance on my survey data set, those users will either need that geographic location (unacceptable) or, alternatively, a set of replicate weights. i can create that set of replicate weights; however, it's quite easy to look at the correlations between those weights and back-calculate which of the survey respondents share the same geographic location. that is also unacceptable.

to help me with this question, you don't have to be familiar with replicate weights -- just think of them as a few columns of strongly-correlated clustered data.

i understand that if i want to maintain that clustering, an evil data user will always have semi-decent guesses at who shares geographic locations; i just want to make that guessing game less precise. on the un-obfuscated replicate weights, an evil data user can figure out 100% of the cases.

request

i am looking for a technique that

  • prevents the public use file users from easily deducing the shared geographic location off of the correlations between my replicate weights variables
  • does not obliterate the correlations between my columns of data (the replicate weights variables)
  • can be implemented on an R data.frame object without a major time investment

i say shared because the evil user might not know where the location is, but they might know if two survey respondents are from the same location -- an unacceptable possibility.

what i have tried

i don't really want to re-invent the wheel here. i am looking for r syntax, an r package, or anything else that would be relatively straightforward to implement. i've found one, two, three, four papers describing techniques that would all be suitable for my purposes; unfortunately, none of the authors have been willing to share actual code to implement them.

i can do simple things like add and subtract random values to my replicate weights columns according to a normal distribution, but i'd prefer to rely on the work of someone who understands privacy issues better than i do.

thanks!!!!

Anthony Damico
  • 5,779
  • 7
  • 46
  • 77
  • 1
    Try looking at the `sdcMicro` package – James Jun 13 '14 at 10:08
  • 1
    You cannot. More than one data scientist/software guru has shown it's easy to extract personal identification from allegedly anonymized big data clumps. Your choice is either, as you noted, to leave a path for someone to reconstruct the geodata, or to remove the geodata entirely and do your analysis based on some other factor. – Carl Witthoft Jun 13 '14 at 11:42
  • 3
    the united states census bureau regularly does what i am describing, despite their own strict confidentiality rules. let's lower the bar and say, "if it's good enough for census, it's good enough for me." i am hereby defining a new term: WWCD? thanks – Anthony Damico Jun 13 '14 at 13:24
  • thanks @James i had never heard of that before! i spent some time trying to answer my own question with that toolkit. :) – Anthony Damico Jun 15 '14 at 11:06

1 Answers1

2

i have written this nine-step tutorial to walk through the process in an attempt to answer my own question. i am not an expert in the field of privacy/confidentiality and would love to hear both feedback about this idea and also other ideas. thanks!

http://www.asdfree.com/2014/09/how-to-provide-variance-calculation-on.html

Anthony Damico
  • 5,779
  • 7
  • 46
  • 77