I want to encrypt values in one column of my Pandas (or PySpark) dataframe, e.g. to take the the column mobno
in the following dataframe, encrypt it and put the result in the encrypted_value
column:
I want to use AWS KMS encryption key. My question is: what is the most elegant way how to achieve this?
I am thinking about using UDF, which will call the boto3's KMS client. Something like:
@udf
def encrypt(plaintext):
response = kms_client.encrypt(
KeyId=aws_kms_key_id,
Plaintext=plaintext
)
ciphertext = response['CiphertextBlob']
return ciphertext
and then applying this udf on the whole column.
But I am not quite confident this is the right way. This stems from the fact that I am an encryption-rookie - first, I don't even know this kms_client_encrypt
function is meant for encrypting values (from the columns) or it is meant for manipulate the keys. Maybe the better way is to obtain the key and then use some python encryption library (such as hashlib
).
I would like to have some clarification on the encryption process and also recommendation what the best approach to column encryption is.