How to save spark dataset in encrypted format?

Question

I am saving my spark dataset as parquet file in my local machine. I would like to know if there are any ways I could encrypt the data using some encryption algorithm. The code I am using to save my data as parquet file looks something like this.

dataset.write().mode("overwrite").parquet(parquetFile);

I saw a similar question but my query is different as I am writing to my local disk.

Did you managed to find the answer? I have similar req – Minisha Feb 22 '21 at 10:10 — Minisha, Feb 22 '21 at 10:10

Mayaa · Answer 1 · 2022-01-11T21:23:41.967

Since Spark 3.2, columnar encryption is supported for Parquet tables.

For example:

hadoopConfiguration.set("parquet.encryption.kms.client.class" ,
   "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS");

// Explicit master keys (base64 encoded) - required only for mock InMemoryKMS
hadoopConfiguration.set("parquet.encryption.key.list" ,
   "keyA:AAECAwQFBgcICQoLDA0ODw== ,  keyB:AAECAAECAAECAAECAAECAA==");

// Activate Parquet encryption, driven by Hadoop properties
hadoopConfiguration.set("parquet.crypto.factory.class" ,
   "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory");

// Write encrypted dataframe files. 
// Column "square" will be protected with master key "keyA".
// Parquet file footers will be protected with master key "keyB"
squaresDF.write().
   option("parquet.encryption.column.keys" , "keyA:square").
   option("parquet.encryption.footer.key" , "keyB").
   parquet("/path/to/table.parquet.encrypted");

// Read encrypted dataframe files
Dataset<Row> df2 = spark.read().parquet("/path/to/table.parquet.encrypted");

This is based on the usage example in: https://spark.apache.org/docs/3.2.0/sql-data-sources-parquet.html#columnar-encryption

Ricardo Piccoli · Accepted Answer · 2021-12-20T15:36:03.563

1

I don't think you can do over Spark directly, however there are other projects you can put around Parquet, in special Apache Arrow. I think this video explains how to do it:

https://databricks.com/session_na21/data-security-at-scale-through-spark-and-parquet-encryption

UPDATE: since Spark 3.2.0 it seems possible.

edited Dec 20 '21 at 15:36

answered Jul 28 '21 at 08:56

Ricardo Piccoli

481
4
14

Spark 3.2.0, released recently, allows to save Spark data frames directly to encrypted Parquet files. This video indeed shows how to do it. – gidon Dec 16 '21 at 13:15
Great news. I will edit my response. – Ricardo Piccoli Dec 20 '21 at 15:35

How to save spark dataset in encrypted format?

2 Answers2