I'm trying to create a shiny app that allows a user to select columns to encrypt where the values in each row should always be the same across subsequent runs if the data is the same. I.e. If customer name = "John" you always get "A" when running this process, if customer name changes to "Jon" you could get "C"... but if changed back to "John" you would get A again. This is going to be used to 'Mask' sensitive data for analysis.
Additionally, if anyone could opine on a method to 'decrypt' these columns by storing a key to be used later... that would be appreciated.
A simplistic version of how I'm attempting to accomplish this(digest library required):
test <- data.frame(CustomerName=c("John Snow","John Snow","Daffy Duck","Daffy Duck","Daffy Duck","Daffy Duck","Daffy Duck","Joe Farmer","Joe Farmer","Joe Farmer","Joe Farmer"),
LoanNumber=c("12548","45878","45796","45813","45125","45216","45125","45778","45126","32548","45683"),
LoanBalance=c("458463","5412548","458463","5412548","458463","5412548","458463","5412548","458463","5412548","2484722"),
FarmType=c("Hay","Dairy","Fish","Hay","Dairy","Fish","Hay","Dairy","Fish","Hay","Dairy"))
test[,1] <- sapply(test[,1],digest,algo="sha1")
Example output:
CustomerName LoanNumber LoanBalance FarmType
1 5c96f777a14f201a6a9b79623d548f7ab61c7a11 12548 458463 Hay
2 5c96f777a14f201a6a9b79623d548f7ab61c7a11 45878 5412548 Dairy
3 10bf345ab114c20df2d1eedbbe7e7cd6b969db05 45796 458463 Fish
4 10bf345ab114c20df2d1eedbbe7e7cd6b969db05 45813 5412548 Hay
5 10bf345ab114c20df2d1eedbbe7e7cd6b969db05 45125 458463 Dairy
6 10bf345ab114c20df2d1eedbbe7e7cd6b969db05 45216 5412548 Fish
7 10bf345ab114c20df2d1eedbbe7e7cd6b969db05 45125 458463 Hay
8 b0db86a39b9617cef61a8986fd57af7960eec9f4 45778 5412548 Dairy
9 b0db86a39b9617cef61a8986fd57af7960eec9f4 45126 458463 Fish
10 b0db86a39b9617cef61a8986fd57af7960eec9f4 32548 5412548 Hay
11 b0db86a39b9617cef61a8986fd57af7960eec9f4 45683 2484722 Dairy
Modified dataframe(removed 'h' in John):
test <- data.frame(CustomerName=c("Jon Snow","Jon Snow","Daffy Duck","Daffy Duck","Daffy Duck","Daffy Duck","Daffy Duck","Joe Farmer","Joe Farmer","Joe Farmer","Joe Farmer"),
LoanNumber=c("12548","45878","45796","45813","45125","45216","45125","45778","45126","32548","45683"),
LoanBalance=c("458463","5412548","458463","5412548","458463","5412548","458463","5412548","458463","5412548","2484722"),
FarmType=c("Hay","Dairy","Fish","Hay","Dairy","Fish","Hay","Dairy","Fish","Hay","Dairy"))
test[,1] <- sapply(test[,1],digest,algo="sha1")
New output:
CustomerName LoanNumber LoanBalance FarmType
1 2cabeabb3b50e04d3b46ea2c68ab12c7350cd87f 12548 458463 Hay
2 2cabeabb3b50e04d3b46ea2c68ab12c7350cd87f 45878 5412548 Dairy
3 b0187b6ff2322fa86004d4d22cd479f3cdc345d2 45796 458463 Fish
4 b0187b6ff2322fa86004d4d22cd479f3cdc345d2 45813 5412548 Hay
5 b0187b6ff2322fa86004d4d22cd479f3cdc345d2 45125 458463 Dairy
6 b0187b6ff2322fa86004d4d22cd479f3cdc345d2 45216 5412548 Fish
7 b0187b6ff2322fa86004d4d22cd479f3cdc345d2 45125 458463 Hay
8 2127453066c45db6ba7e2f6f8c14d22796c3fd54 45778 5412548 Dairy
9 2127453066c45db6ba7e2f6f8c14d22796c3fd54 45126 458463 Fish
10 2127453066c45db6ba7e2f6f8c14d22796c3fd54 32548 5412548 Hay
11 2127453066c45db6ba7e2f6f8c14d22796c3fd54 45683 2484722 Dairy
What I would have expected:
CustomerName LoanNumber LoanBalance FarmType
1 2cabeabb3b50e04d3b46ea2c68ab12c7350cd87f 12548 458463 Hay
2 2cabeabb3b50e04d3b46ea2c68ab12c7350cd87f 45878 5412548 Dairy
3 10bf345ab114c20df2d1eedbbe7e7cd6b969db05 45796 458463 Fish
4 10bf345ab114c20df2d1eedbbe7e7cd6b969db05 45813 5412548 Hay
5 10bf345ab114c20df2d1eedbbe7e7cd6b969db05 45125 458463 Dairy
6 10bf345ab114c20df2d1eedbbe7e7cd6b969db05 45216 5412548 Fish
7 10bf345ab114c20df2d1eedbbe7e7cd6b969db05 45125 458463 Hay
8 b0db86a39b9617cef61a8986fd57af7960eec9f4 45778 5412548 Dairy
9 b0db86a39b9617cef61a8986fd57af7960eec9f4 45126 458463 Fish
10 b0db86a39b9617cef61a8986fd57af7960eec9f4 32548 5412548 Hay
11 b0db86a39b9617cef61a8986fd57af7960eec9f4 45683 2484722 Dairy
Am I misunderstanding how this works? If I apply the same logic to multiple columns I get the same values for the unaltered column, but the issue persists for the column with modified values. I attempted to Vectorize the digest function just to ensure my sapply function wasn't the issue with the same results. Any ideas?