I need a pipeline that
- ingests sensitive data from an API
- de-identify/encrypt specific fields based on certain conditions
- upload data post encryption (newline delimited JSON format) into BigQuery table
In addition to the above, I also need to be able to re-identify/decrypt data within BigQuery (eg UDFs, AEAD functions).
The issue right now is that I cannot figure out how I can encrypt this data in Python in a way that can be re-identified/decrypted in BigQuery.
So far I've seen many examples of pipelines encrypting data using Dataflow/DLP/Cloud KMS or Python libraries (eg Fernet). These same examples also show how the pipeline can decrypt data as well. They don't, however, provide a way to decrypt the data directly in BQ.
I've also seen how you can encrypt/decrypt data using BQ AEAD functions. I have not yet figured out how I can encrypt data in Python so that it can be decrypted in BQ.
I have thought about doing the encryption process in BQ instead of Airflow/Python via staging tables, but it is complicated because of the amount of nested fields that would have to be encrypted.
The encryption part is easier to do in Python/Airflow. The decryption step is easy in BigQuery.
How can I use an encryption method in Python that can be decrypted in BigQuery?