0

I have a column in my data frame that is sensitive. I need to replace the sensitive value with a number, but have to do it so that the distinct counts of the column in question stays accurate. I was thinking of a sql function over a window partition. But couldn't find a way.

A sample dataframe is below.

    df = (sc.parallelize([
    {"sensitive_id":"1234"},
    {"sensitive_id":"1234"}, 
    {"sensitive_id":"1234"},
    {"sensitive_id":"2345"},
    {"sensitive_id":"2345"},
    {"sensitive_id":"6789"},
    {"sensitive_id":"6789"},
    {"sensitive_id":"6789"},
    {"sensitive_id":"6789"}
 ]).toDF()
.cache()
      )

enter image description here

I would like to create a dataframe like below.

What is a way to get this done.

enter image description here

2 Answers2

0

You are looking for dense_rank function :

df.withColumn(
  "non_sensitive_id",
  F.dense_rank().over(Window.partitionBy().orderBy("sensitive_id"))
).show()

+------------+----------------+
|sensitive_id|non_sensitive_id|
+------------+----------------+
|        1234|               1|
|        1234|               1|
|        1234|               1|
|        2345|               2|
|        2345|               2|
|        6789|               3|
|        6789|               3|
|        6789|               3|
|        6789|               3|
+------------+----------------+
Steven
  • 14,048
  • 6
  • 38
  • 73
0

This is another way of doing this, may not be very efficient because join() will involve a shuffle -

Creating the DataFrame -

from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
df = sqlContext.createDataFrame([(1234,),(1234,),(1234,),(2345,),(2345,),(6789,),(6789,),(6789,),(6789,)],['sensitive_id']) 

Creating a DataFrame of distinct elements and labeling them 1,2,3... and finally joining the two dataframes.

df_distinct = df.select('sensitive_id').distinct().withColumn('non_sensitive_id', row_number().over(Window.orderBy('sensitive_id')))
df = df.join(df_distinct, ['sensitive_id'],how='left').orderBy('sensitive_id')
df.show()
+------------+----------------+
|sensitive_id|non_sensitive_id|
+------------+----------------+
|        1234|               1|
|        1234|               1|
|        1234|               1|
|        2345|               2|
|        2345|               2|
|        6789|               3|
|        6789|               3|
|        6789|               3|
|        6789|               3|
+------------+----------------+
cph_sto
  • 7,189
  • 12
  • 42
  • 78