0

Technical background: I am getting tables data from kafka and putting it into hudi and hive tables using spark. I am using AWS EMR. I want to encrypt data in transit within the cluster as well as synced external tables data present in s3 (Data at rest)

Note: I dont want to use AWS EMR encryption, want to use spark or hudi encryption, doesn't want to stick to AWS only, want platform independent solution

I read about hudi/spark encryption (link) but thats a columnar encryption, I dont want to encrypt specific column, I want all data to be encrypted, so Is there any spark configuration to encrypt whole data in rest as well as transit within cluster?

TIA

Roobal Jindal
  • 214
  • 2
  • 13

1 Answers1

1

Parquet Modular Encryption is the only encrypted method supported by parquet for client side encryption. You can use it to dynamically encrypt all the columns with the same key by getting the columns list and add all of them to the encryption config:

jsc.hadoopConfiguration().set("parquet.crypto.factory.class", "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
jsc.hadoopConfiguration().set("parquet.encryption.kms.client.class" , "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
jsc.hadoopConfiguration().set("parquet.encryption.key.list", "my_key:<some key>")
jsc.hadoopConfiguration().set("parquet.encryption.column.keys", "my_key:%s".format(df.columns.mkString(",")))

If you want to encrypt the whole files, the best solution is using server side encryption, but you need to configure each storage service (S3, GCS, hdfs, ...). Try to avoid this solution if you really use multiple storage services.

Hussein Awala
  • 4,285
  • 2
  • 9
  • 23