0

I have some tables in which I need to mask some of its columns. Columns to be masked vary from table to table and I am reading those columns from application.conf file.

For example, for employee table as shown below

+----+------+-----+---------+
| id | name | age | address |
+----+------+-----+---------+
| 1  | abcd | 21  | India   |
+----+------+-----+---------+
| 2  | qazx | 42  | Germany |
+----+------+-----+---------+

if we want to mask name and age columns then I get these columns in an sequence.

val mask = Seq("name", "age")

Expected values after masking are:

+----+----------------+----------------+---------+
| id | name           | age            | address |
+----+----------------+----------------+---------+
| 1  | *** Masked *** | *** Masked *** | India   |
+----+----------------+----------------+---------+
| 2  | *** Masked *** | *** Masked *** | Germany |
+----+----------------+----------------+---------+

If I have employee table an data frame, then what is the way to mask these columns?

If I have payment table as shown below and want to mask name and salary columns then I get mask columns in Sequence as

+----+------+--------+----------+
| id | name | salary | tax_code |
+----+------+--------+----------+
| 1  | abcd | 12345  | KT10     |
+----+------+--------+----------+
| 2  | qazx | 98765  | AD12d    |
+----+------+--------+----------+
val mask = Seq("name", "salary")

I tried something like this mask.foreach(c => base.withColumn(c, regexp_replace(col(c), "^.*?$", "*** Masked ***" ) ) ) but it did not returned anything.


Thanks to @philantrovert, I found out the solution. Here is the solution I used:

def maskData(base: DataFrame, maskColumns: Seq[String]) = {
    val maskExpr = base.columns.map { col => if(maskColumns.contains(col)) s"'*** Masked ***' as ${col}" else col }
    base.selectExpr(maskExpr: _*)
}
zero323
  • 322,348
  • 103
  • 959
  • 935
Shekhar
  • 11,438
  • 36
  • 130
  • 186

3 Answers3

6

The simplest and fastest way would be to use withColumn and simply overwrite the values in the columns with "*** Masked ***". Using your small example dataframe

val df = spark.sparkContext.parallelize( Seq (
  (1, "abcd", 12345, "KT10" ),
  (2, "qazx", 98765, "AD12d")
)).toDF("id", "name", "salary", "tax_code")

If you have a small number of columns to be masked, with known names, then you can simply do:

val mask = Seq("name", "salary")

df.withColumn("name", lit("*** Masked ***"))
  .withColumn("salary", lit("*** Masked ***"))

Otherwise, you need to create a loop:

var df2 = df
for (col <- mask){
  df2 = df2.withColumn(col, lit("*** Masked ***"))
}

Both these approaches will give you a result like this:

+---+--------------+--------------+--------+
| id|          name|        salary|tax_code|
+---+--------------+--------------+--------+
|  1|*** Masked ***|*** Masked ***|    KT10|
|  2|*** Masked ***|*** Masked ***|   AD12d|
+---+--------------+--------------+--------+
Shaido
  • 27,497
  • 23
  • 70
  • 73
2

Please check the code below. The key is the udf function.

val df = ss.sparkContext.parallelize( Seq (
  ("c1", "JAN-2017", 49 ),
  ("c1", "MAR-2017", 83),
)).toDF("city", "month", "sales")
df.show()

val mask = udf( (s : String) => {
  "*** Masked ***"
})

df.withColumn("city", mask($"city")).show`
philantrovert
  • 9,904
  • 3
  • 37
  • 61
Robin
  • 695
  • 1
  • 7
  • 10
  • Using a udf breaks Spark's built in optimizer. Better to use the lit function with WithColumn than use a udf. Tip for others if they see this. – Zack Jul 10 '23 at 18:51
1

Your statement

mask.foreach(c => base.withColumn(c, regexp_replace(col(c), "^.*?$", "*** Masked ***" ) ) )

will return a List[org.apache.spark.sql.DataFrame] which doesn't sound too good.

You can use selectExpr and generate your regexp_replace expression using :

base.show
+---+----+-----+-------+
| id|name|  age|address|
+---+----+-----+-------+
|  1|abcd|12345|  KT10 |
|  2|qazx|98765|  AD12d|
+---+----+-----+-------+

val mask = Seq("name", "age")
val expr = df.columns.map { col =>
   if (mask.contains(col) ) s"""regexp_replace(${col}, "^.*", "** Masked **" ) as ${col}"""
   else col
 }

This will generate an expression with regex_replace for the columns that are present in the Sequence mask

Array[String] = Array(id, regexp_replace(name, "^.*", "** Masked **" ) as name, regexp_replace(age, "^.*", "** Masked **" ) as age, address)

Now you can use selectExpr on the generated Sequence

base.selectExpr(expr: _*).show

+---+------------+------------+-------+
| id|        name|         age|address|
+---+------------+------------+-------+
|  1|** Masked **|** Masked **|  KT10 |
|  2|** Masked **|** Masked **|  AD12d|
+---+------------+------------+-------+
philantrovert
  • 9,904
  • 3
  • 37
  • 61