0

I have a streaming Dataframe reading messages from Kafka. Before I initiate writeStream, I want to hash some columns, and mask few others. Columns to be hashed or masked would be different for different tables and hence I am making them parameterized.

I am using below code to mask the selected columns which is working fine.

var maskColumnsconfig = "COL1, COL2" //Reading columns to mask from Config file or widget
var maskColumns = maskColumnsconfig.split(",") 
def maskData(base: DataFrame, maskColumns: Seq[String]) = {
    val maskExpr = base.columns.map { col => if(maskColumns.contains(col)) s"null as ${col}" else col }
    base.selectExpr(maskExpr: _*) //masking columns as null
}
val maskedDF = maskData(myDataFrame,Seq(maskColumns:_*))

Reference: - How to mask columns using Spark 2?

For hashing, I am looking to create a function that does something similar to below: -

myDataFrame.withColumn("COL1_hashed",sha2($"COL1",256)).drop($"COL1").withColumnRenamed("COL1_Hashed", "COL1").withColumn("COL2_hashed",sha2($"COL2",256)).drop($"COL2").withColumnRenamed("COL2_Hashed", "COL2")

Edit: Instead, I can just do: -

myDataFrame.withColumn("COL1",sha2($"COL1",256).withColumn("COL2,sha($"COL2",256)

i.e.
1. add hashed column, 2. drop original col, 3. rename hashed col to name of original col
4. repeat for other columns to be hashed

EDIT: 1.replace column with hashed values, 2. repeat for other columns to be hashed

Any suggestions/ideas on how it can be achieved using a function that takes in multiple columns and perform above operations on all of them. I tried creating a function like below but it gives an error:

def hashData(base: DataFrame, hashColumns: Seq[String]) = {
    val hashExpr = base.columns.map { col => if(hashColumns.contains(col)) base.withColumn({col},sha2({col},256)) else col }
    base.selectExpr(hashExpr: _*)
}
command-3855877266331823:2: error: type mismatch;
 found   : String
 required: org.apache.spark.sql.Column
    val hashExpr = base.columns.map { col => if(hashColumns.contains(col)) base.withColumn({col},sha2({col},256)) else col }

EDIT 2: Tried to imitate the function similar to masking but it too gives error.

def hashData(base: DataFrame, hashColumns: Seq[String]) = {
    val hashExpr = base.columns.map { col => if(hashColumns.contains(col)) base.withColumn(col,sha2(base(col),256)) else col }
    base.selectExpr(hashExpr: _*)
}

Error: -

 found   : Array[java.io.Serializable]
 required: Array[_ <: String]
Note: java.io.Serializable >: String, but class Array is invariant in type T.
You may wish to investigate a wildcard type such as `_ >: String`. (SLS 3.2.10)
    base.selectExpr(hashExpr: _*)

Ideally, I'd want one function doing both hashing and masking. I'd appreciate any ideas/leads in achieving this.

Scala: 2.11 Spark: 2.4.4

Swapandeep Singh
  • 91
  • 1
  • 3
  • 11
  • Hi, Have you found solution, same scenario and same issue I am facing fromside – venkat Ramanan VTR Mar 22 '21 at 04:49
  • @venkatRamananVTR - My solution is in the edit i.e. myDataFrame.withColumn("COL1",sha2($"COL1",256).withColumn("COL2,sha($"COL2",256). 1.replace column with hashed values, 2. repeat for other columns to be hashed – Swapandeep Singh Mar 26 '21 at 00:59

0 Answers0