It is possible to create new codec that compresses the file(s) first and then encrypts them. The idea is to wrap the output streams of the codec with a CipherOutputStream before writing to the file system.
import java.io.{IOException, OutputStream}
import javax.crypto.{Cipher, CipherOutputStream}
import javax.crypto.spec.SecretKeySpec
import org.apache.hadoop.io.compress._
class GzipEncryptionCodec extends GzipCodec {
override def getDefaultExtension(): String = ".gz.enc"
@throws[IOException]
override def createOutputStream(out: OutputStream): CompressionOutputStream =
super.createOutputStream(wrapWithCipherStream(out))
@throws[IOException]
override def createOutputStream(out: OutputStream, compressor: Compressor): CompressionOutputStream =
super.createOutputStream(wrapWithCipherStream(out), compressor)
def wrapWithCipherStream(out: OutputStream): OutputStream = {
val cipher = Cipher.getInstance("AES/ECB/PKCS5Padding") //or another algorithm
val secretKey = new SecretKeySpec(
"hello world 1234".getBytes, //this is not a secure password!
"AES")
cipher.init(Cipher.ENCRYPT_MODE, secretKey)
return new CipherOutputStream(out, cipher)
}
}
When writing the csv file this codec can be used:
df.write
.option("codec","GzipEncryptionCodec")
.mode(SaveMode.Overwrite).csv("encryped_csv")
and the output files will be encrypted and get the suffix .gz.enc
.
This codec only encrypts the data and cannot decrypt it. Some background on why changing the codec for reading is more difficult than for writing can be found here.
Instead the files can be read and decryped with a simple Scala program:
import javax.crypto.Cipher
import javax.crypto.spec.SecretKeySpec
import java.io.FileInputStream
import java.util.zip.GZIPInputStream
import javax.crypto.CipherInputStream
val cipher = Cipher.getInstance("AES/ECB/PKCS5Padding")
val secretKey = new SecretKeySpec("hello world 1234".getBytes(), "AES")
cipher.init(Cipher.DECRYPT_MODE, secretKey)
val files = new File("encryped_csv").listFiles.filter(_.getName().endsWith(".gz.enc")).toList
files.foreach(f => {
val dec = new CipherInputStream(new FileInputStream(f), cipher)
val gz = new GZIPInputStream(dec)
val result = scala.io.Source.fromInputStream(gz).mkString
println(f.getName)
println(result)
})