2

Using this below code I'm able to compress and save it as .gz file

import spark.implicits._
 

val someDF = Seq(
  (8, "bat"),
  (64, "mouse"),
  (-27, "horse")
).toDF("number", "word")

someDF.coalesce(1)
   .write.format("com.databricks.spark.csv")
    .option("header", "true")
    .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
    .save("example.csv.gz")

Does spark provides an option to compress data with password protected ? I couldn't able to find in the spark documentation.

Tulasi
  • 79
  • 1
  • 9

2 Answers2

3

It is possible to create new codec that compresses the file(s) first and then encrypts them. The idea is to wrap the output streams of the codec with a CipherOutputStream before writing to the file system.

import java.io.{IOException, OutputStream}

import javax.crypto.{Cipher, CipherOutputStream}
import javax.crypto.spec.SecretKeySpec
import org.apache.hadoop.io.compress._


class GzipEncryptionCodec extends GzipCodec {

  override def getDefaultExtension(): String = ".gz.enc"

  @throws[IOException]
  override def createOutputStream(out: OutputStream): CompressionOutputStream =
    super.createOutputStream(wrapWithCipherStream(out))

  @throws[IOException]
  override def createOutputStream(out: OutputStream, compressor: Compressor): CompressionOutputStream =
    super.createOutputStream(wrapWithCipherStream(out), compressor)

  def wrapWithCipherStream(out: OutputStream): OutputStream = {
    val cipher = Cipher.getInstance("AES/ECB/PKCS5Padding") //or another algorithm
    val secretKey = new SecretKeySpec(
      "hello world 1234".getBytes, //this is not a secure password!
      "AES")
    cipher.init(Cipher.ENCRYPT_MODE, secretKey)
    return new CipherOutputStream(out, cipher)
  }
}

When writing the csv file this codec can be used:

df.write
  .option("codec","GzipEncryptionCodec")
  .mode(SaveMode.Overwrite).csv("encryped_csv")

and the output files will be encrypted and get the suffix .gz.enc.

This codec only encrypts the data and cannot decrypt it. Some background on why changing the codec for reading is more difficult than for writing can be found here.

Instead the files can be read and decryped with a simple Scala program:

import javax.crypto.Cipher
import javax.crypto.spec.SecretKeySpec
import java.io.FileInputStream
import java.util.zip.GZIPInputStream

import javax.crypto.CipherInputStream
val cipher = Cipher.getInstance("AES/ECB/PKCS5Padding")
val secretKey = new SecretKeySpec("hello world 1234".getBytes(), "AES")
cipher.init(Cipher.DECRYPT_MODE, secretKey)

val files = new File("encryped_csv").listFiles.filter(_.getName().endsWith(".gz.enc")).toList

files.foreach(f => {
  val dec = new CipherInputStream(new FileInputStream(f), cipher)
  val gz = new GZIPInputStream(dec)
  val result = scala.io.Source.fromInputStream(gz).mkString
  println(f.getName)
  println(result)
})
werner
  • 13,518
  • 6
  • 30
  • 45
0

Gzip itself doesn't support password protection. On Unix, you need to use other tools for to encrypt the file using password.

P.S. Also, replace com.databricks.spark.csv with just csv - Spark supports CSV for a long time already. And remove corresponding Maven dependency.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • any other spark codec/lib provides the capability to compress the csv with password protected? – Tulasi Sep 11 '20 at 06:58
  • This lib (https://github.com/srikanth-lingala/zip4j) may help I guess either I have to save all the csv files in the local and then created a zip with password protected. but i'm not sure will this work in cluster mode or not. – Tulasi Sep 11 '20 at 07:01
  • I'm not aware about something ready to use for CSV + something combination in Spark. I think that you may need to resort to low-level stuff, like, forEachRDD, generated CSV "manually" and write files yourself. But it would be challenging if you aren't very deep into Spark – Alex Ott Sep 11 '20 at 07:20