0

I went through this Google Cloud Documentation, which mentions that :-

Dataflow can access sources and sinks that are protected by Cloud KMS keys without you having to specify the Cloud KMS key of those sources and sinks, as long as you are not creating new objects.

I have a few questions regarding this:

Q.1. Does this mean we don't need to decrypt the encrypted source file within our Beam code ? Does Dataflow has this functionality built-in?

Q.2. If the source file is encrypted, will the output file from Dataflow be encrypted by default with the same key (let us say we have a symmetric key) ?

Q.3. What are the objects that are being referred here?

PS: I want to read from an encrypted AVRO file placed in the GCS bucket, apply my Apache Beam Transforms from my code and write an encrypted file back to the bucket.

Krish
  • 752
  • 3
  • 10

1 Answers1

1

Cloud Dataflow is a fully managed service where if encryption is not specified, it automatically applies Cloud KMS encryption. Cloud KMS is cloud hosted key management service that can manage both symmetric and asymmetric cryptographic keys.

  • When Cloud KMS is used with Cloud Dataflow, it allows you to encrypt the data that is to be processed in the Dataflow pipeline. Using Cloud KMS, the data that is temporarily stored in temporary storage like Persistent Disk can be encrypted to get end-to-end protection of data. You need not to decrypt the source file within the beam code as data from the sources is encrypted and decryption will be done automatically by the Dataflow.

  • If you are using a symmetric key, then a single key can be used for both encryption and decryption of the data which is managed by Cloud KMS stored in ciphertext. If you are using an asymmetric key, then a public key will be used to encrypt the data and a private key will be used to decrypt the data. You need to provide Cloud KMS CryptoKey Encrypter/Decrypter role to the Dataflow service account before performing encryption and decryption. Cloud KMS automatically determines the key for decryption based on the provided ciphertext so no need to take extra care for decryption.

  • The objects that you have mentioned which are encrypted by the Cloud KMS can be tables in BigQuery, files in Cloud Storage and different data in the sources and sinks.

For more information you can check this blog.

Shipra Sarkar
  • 1,385
  • 3
  • 10
  • To be clear, let's say I have "field_name":"encrypted_value" in my source file. And I want to do some transformation on the decrypted values. So what should be my approach. 1) decrypt it to "decrypted_value" and then do my transformation. 2) directly apply the transformation, as dataflow makes sure that it gets applied to the real value and not the encrypted string. 1 or 2 – Krish Jun 04 '22 at 06:49
  • E.g. My original file is "animal":"cat", but since it is in encrypted form it is "animal":b'encrypted_bytes'. Now I want to transform the files to "animal":"orange cat" and get the encrypted file for the same in sink. I don't want it to be something like "animal":"orange"+b'encrypted_bytes'. Then the transformation is working on wrong data. – Krish Jun 04 '22 at 07:17
  • You can perform any type of [transformation](https://cloud.google.com/dataflow/docs/concepts/beam-programming-model#concepts) on your data in dataflow. If you want to perform transformation on decrypted values then you have to design the pipeline in such a way that the transformation is applied on decrypted values. – Shipra Sarkar Jun 10 '22 at 13:17