1

I have two data sets that I need to link together, in the sense that I have to find the records that appear in both data sets within a certain margin of error (for example, a person's first name is misspelled in one of the sets, a person moved, married and thereby got a different surname, etc.)

Since the data is sensitive, it should be anonymized. However, I cannot use standard anonymization techniques (hashing for example), since that wouldn't preserve some properties vital to linking records.

Therefore, I am looking for a way to anonymize my textual data in a way that it preserves for example Levenshtein distance. Do such techniques exist?

konewka
  • 620
  • 8
  • 21
  • I am not sure about this specific use case but this class of problem is the motivation of [homomorphic encryption](https://en.wikipedia.org/wiki/Homomorphic_encryption). You might want to look in that direction. – morsecodist Feb 19 '18 at 09:34
  • I was indeed thinking in the direction of homomorphic encryption, but do such methods also exist for textual data? – konewka Feb 19 '18 at 09:35
  • 1
    After taking a glance at some papers it seems like some researchers have had some success. [Here](http://ieeexplore.ieee.org/abstract/document/5711449/?reload=true) is a paper on it – morsecodist Feb 19 '18 at 09:41

0 Answers0