I am trying to anonymize the production data with human readable replacements - this will not only mask the actual data but also will give it a callable identity for recognition. Please help me on how to anonymize the dataframe columns like firstname, lastname, fullname with other pronunciable english words in Scala:
- It must convert a real world name into another real world name which is pronounceable and identifiable.
- It must be possible to convert first name, last name and full name separately, such that full name = first name and last name separated by a space.
- It should produce the same anomynized name for a name on every iteration.
- The target dataset will have more than a million distinct records.
I have tried iterating over a dictionary of nouns and adjectives to reach a combination of two pronunciable words but it is not going to give me a million distinct combinations. Code below:
def anonymizeString(s: Option[String]): Option[String] = {
val AsciiUpperLetters = ('A' to 'Z').toList.filter(_.isLetter)
val AsciiLowerLetters = ('a' to 'z').toList.filter(_.isLetter)
val UtfLetters = (128.toChar to 256.toChar).toList.filter(_.isLetter)
val Numbers = ('0' to '9')
s match {
//case None => None
case _ =>
val seed = scala.util.hashing.MurmurHash3.stringHash(s.get).abs
val random = new scala.util.Random(seed)
var r = ""
for (c <- s.get) {
if (Numbers.contains(c)) {
r = r + (((random.nextInt.abs + c) % Numbers.size))
} else if (AsciiUpperLetters.contains(c)) {
r = r + AsciiUpperLetters(((random.nextInt.abs) % AsciiUpperLetters.size))
} else if (AsciiLowerLetters.contains(c)) {
r = r + AsciiLowerLetters(((random.nextInt.abs) % AsciiLowerLetters.size))
} else if (UtfLetters.contains(c)) {
r = r + UtfLetters(((random.nextInt.abs) % UtfLetters.size))
} else {
r = r + c
}
}
Some(r)
}