0

I'm trying to create a shiny app that allows a user to select columns to encrypt where the values in each row should always be the same across subsequent runs if the data is the same. I.e. If customer name = "John" you always get "A" when running this process, if customer name changes to "Jon" you could get "C"... but if changed back to "John" you would get A again. This is going to be used to 'Mask' sensitive data for analysis.

Additionally, if anyone could opine on a method to 'decrypt' these columns by storing a key to be used later... that would be appreciated.

A simplistic version of how I'm attempting to accomplish this(digest library required):

test <- data.frame(CustomerName=c("John Snow","John Snow","Daffy Duck","Daffy Duck","Daffy Duck","Daffy Duck","Daffy Duck","Joe Farmer","Joe Farmer","Joe Farmer","Joe Farmer"),
               LoanNumber=c("12548","45878","45796","45813","45125","45216","45125","45778","45126","32548","45683"),
               LoanBalance=c("458463","5412548","458463","5412548","458463","5412548","458463","5412548","458463","5412548","2484722"),
               FarmType=c("Hay","Dairy","Fish","Hay","Dairy","Fish","Hay","Dairy","Fish","Hay","Dairy"))


test[,1] <- sapply(test[,1],digest,algo="sha1")

Example output:

                                   CustomerName LoanNumber LoanBalance FarmType
1  5c96f777a14f201a6a9b79623d548f7ab61c7a11      12548      458463      Hay
2  5c96f777a14f201a6a9b79623d548f7ab61c7a11      45878     5412548    Dairy
3  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45796      458463     Fish
4  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45813     5412548      Hay
5  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45125      458463    Dairy
6  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45216     5412548     Fish
7  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45125      458463      Hay
8  b0db86a39b9617cef61a8986fd57af7960eec9f4      45778     5412548    Dairy
9  b0db86a39b9617cef61a8986fd57af7960eec9f4      45126      458463     Fish
10 b0db86a39b9617cef61a8986fd57af7960eec9f4      32548     5412548      Hay
11 b0db86a39b9617cef61a8986fd57af7960eec9f4      45683     2484722    Dairy

Modified dataframe(removed 'h' in John):

    test <- data.frame(CustomerName=c("Jon Snow","Jon Snow","Daffy Duck","Daffy Duck","Daffy Duck","Daffy Duck","Daffy Duck","Joe Farmer","Joe Farmer","Joe Farmer","Joe Farmer"),
           LoanNumber=c("12548","45878","45796","45813","45125","45216","45125","45778","45126","32548","45683"),
           LoanBalance=c("458463","5412548","458463","5412548","458463","5412548","458463","5412548","458463","5412548","2484722"),
           FarmType=c("Hay","Dairy","Fish","Hay","Dairy","Fish","Hay","Dairy","Fish","Hay","Dairy"))
test[,1] <- sapply(test[,1],digest,algo="sha1")

New output:

                                   CustomerName LoanNumber LoanBalance FarmType
1  2cabeabb3b50e04d3b46ea2c68ab12c7350cd87f      12548      458463      Hay
2  2cabeabb3b50e04d3b46ea2c68ab12c7350cd87f      45878     5412548    Dairy
3  b0187b6ff2322fa86004d4d22cd479f3cdc345d2      45796      458463     Fish
4  b0187b6ff2322fa86004d4d22cd479f3cdc345d2      45813     5412548      Hay
5  b0187b6ff2322fa86004d4d22cd479f3cdc345d2      45125      458463    Dairy
6  b0187b6ff2322fa86004d4d22cd479f3cdc345d2      45216     5412548     Fish
7  b0187b6ff2322fa86004d4d22cd479f3cdc345d2      45125      458463      Hay
8  2127453066c45db6ba7e2f6f8c14d22796c3fd54      45778     5412548    Dairy
9  2127453066c45db6ba7e2f6f8c14d22796c3fd54      45126      458463     Fish
10 2127453066c45db6ba7e2f6f8c14d22796c3fd54      32548     5412548      Hay
11 2127453066c45db6ba7e2f6f8c14d22796c3fd54      45683     2484722    Dairy

What I would have expected:

    CustomerName LoanNumber LoanBalance FarmType
1  2cabeabb3b50e04d3b46ea2c68ab12c7350cd87f      12548      458463      Hay
2  2cabeabb3b50e04d3b46ea2c68ab12c7350cd87f      45878     5412548    Dairy
3  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45796      458463     Fish
4  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45813     5412548      Hay
5  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45125      458463    Dairy
6  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45216     5412548     Fish
7  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45125      458463      Hay
8  b0db86a39b9617cef61a8986fd57af7960eec9f4      45778     5412548    Dairy
9  b0db86a39b9617cef61a8986fd57af7960eec9f4      45126      458463     Fish
10 b0db86a39b9617cef61a8986fd57af7960eec9f4      32548     5412548      Hay
11 b0db86a39b9617cef61a8986fd57af7960eec9f4      45683     2484722    Dairy

Am I misunderstanding how this works? If I apply the same logic to multiple columns I get the same values for the unaltered column, but the issue persists for the column with modified values. I attempted to Vectorize the digest function just to ensure my sapply function wasn't the issue with the same results. Any ideas?

sc305495
  • 249
  • 3
  • 11

1 Answers1

0

I think that I've answered my own question... of course right after I post it here :).

The digest function has a serialize parameter with the following documentation:A logical variable indicating whether the object should be serialized using serialize (in ASCII form). Setting this to FALSE allows to compare the digest output of given character strings to known control output. It also allows the use of raw vectors such as the output of non-ASCII serialization.

Setting serialize to FALSE seems to resolve the problem and I get the expected output.

ex:

test[,1] <- sapply(test[,1],digest,algo="sha1",serialize = FALSE)
sc305495
  • 249
  • 3
  • 11