4

I need a pipeline that

  • ingests sensitive data from an API
  • de-identify/encrypt specific fields based on certain conditions
  • upload data post encryption (newline delimited JSON format) into BigQuery table

In addition to the above, I also need to be able to re-identify/decrypt data within BigQuery (eg UDFs, AEAD functions).

The issue right now is that I cannot figure out how I can encrypt this data in Python in a way that can be re-identified/decrypted in BigQuery.

So far I've seen many examples of pipelines encrypting data using Dataflow/DLP/Cloud KMS or Python libraries (eg Fernet). These same examples also show how the pipeline can decrypt data as well. They don't, however, provide a way to decrypt the data directly in BQ.

I've also seen how you can encrypt/decrypt data using BQ AEAD functions. I have not yet figured out how I can encrypt data in Python so that it can be decrypted in BQ.

I have thought about doing the encryption process in BQ instead of Airflow/Python via staging tables, but it is complicated because of the amount of nested fields that would have to be encrypted.

The encryption part is easier to do in Python/Airflow. The decryption step is easy in BigQuery.

How can I use an encryption method in Python that can be decrypted in BigQuery?

Korean_Of_the_Mountain
  • 1,428
  • 3
  • 16
  • 40

1 Answers1

1

I have the same question as you and have made some progress thanks to this issue.

To Encrypt in Python, run this code:

import io
import base64
import json

import tink
from tink import aead, cleartext_keyset_handle
# sourced from https://github.com/google/tink/issues/373

aead.register()

# Create key
keyset_handle = tink.new_keyset_handle(aead.aead_key_templates.AES256_GCM)
aead_primitive = keyset_handle.primitive(aead.Aead)

# encrypt your text
plaintext = b'hello'
associated_data = b'context'
ciphertext = aead_primitive.encrypt(plaintext, associated_data)
print(ciphertext)

# print the key
out = io.BytesIO()
writer = tink.BinaryKeysetWriter(out)
cleartext_keyset_handle.write(writer, keyset_handle)
out.seek(0)
print(base64.b64encode(out.read()))

The output is:

b'\x01T0\xedK\x06\xb9;.j\xec^f\xd3\x01\xf4)\xeb\n\x86\r>E\xd5\x95\xc3\x8f.\x8c\x8bL(O"\xdc0]!'
b'CMvaw6EFEmQKWAowdHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUuY3J5cHRvLnRpbmsuQWVzR2NtS2V5EiIaICn4WTH+lR6jWA5kV7k3WEAaPXUg1B0y6KKGBPCWD8RuGAEQARjL2sOhBSAB'

Now in BigQuery, you can decrypt via using the AEAD.DECRYPT_STRING and FROM_BASE64() functions as follows (for simplicity I pasted the keys, ciphertext, and additional data obtained in Python):

select AEAD.DECRYPT_STRING(
    FROM_BASE64('CMvaw6EFEmQKWAowdHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUuY3J5cHRvLnRpbmsuQWVzR2NtS2V5EiIaICn4WTH+lR6jWA5kV7k3WEAaPXUg1B0y6KKGBPCWD8RuGAEQARjL2sOhBSAB'),
      b'\x01T0\xedK\x06\xb9;.j\xec^f\xd3\x01\xf4)\xeb\n\x86\r>E\xd5\x95\xc3\x8f.\x8c\x8bL(O"\xdc0]!',
      'context' 
    )

the result is the decrypted text:

enter image description here

Emilie
  • 71
  • 4
  • This looks great! Unfortunately it might not work for me per se because i have the added constraint of trying to make this work in Composer (Airflow version 1.10.15. And I don't think I'm able to install `tink` in my environment. – Korean_Of_the_Mountain Sep 01 '21 at 15:44