PySpark create new column with mapping from a dict

Question

Using Spark 1.6, I have a Spark DataFrame column (named let's say col1) with values A, B, C, DS, DNS, E, F, G and H. I want to create a new column (say col2) with the values from the dict here below. How do I map this? (e,g. 'A' needs to be mapped to 'S' etc.)

dict = {'A': 'S', 'B': 'S', 'C': 'S', 'DS': 'S', 'DNS': 'S', 'E': 'NS', 'F': 'NS', 'G': 'NS', 'H': 'NS'}

score 120 · Accepted Answer · edited Aug 05 '22 at 09:36

Inefficient solution with UDF (version independent):

from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

def translate(mapping):
    def translate_(col):
        return mapping.get(col)
    return udf(translate_, StringType())

df = sc.parallelize([('DS', ), ('G', ), ('INVALID', )]).toDF(['key'])
mapping = {
    'A': 'S', 'B': 'S', 'C': 'S', 'DS': 'S', 'DNS': 'S', 
    'E': 'NS', 'F': 'NS', 'G': 'NS', 'H': 'NS'}

df.withColumn("value", translate(mapping)("key"))

with the result:

+-------+-----+
|    key|value|
+-------+-----+
|     DS|    S|
|      G|   NS|
|INVALID| null|
+-------+-----+

Much more efficient (Spark >= 2.0, Spark < 3.0) is to create a MapType literal:

from pyspark.sql.functions import col, create_map, lit
from itertools import chain

mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])

df.withColumn("value", mapping_expr.getItem(col("key")))

with the same result:

+-------+-----+
|    key|value|
+-------+-----+
|     DS|    S|
|      G|   NS|
|INVALID| null|
+-------+-----+

but more efficient execution plan:

== Physical Plan ==
*Project [key#15, keys: [B,DNS,DS,F,E,H,C,G,A], values: [S,S,S,NS,NS,NS,S,NS,S][key#15] AS value#53]
+- Scan ExistingRDD[key#15]

compared to UDF version:

== Physical Plan ==
*Project [key#15, pythonUDF0#61 AS value#57]
+- BatchEvalPython [translate_(key#15)], [key#15, pythonUDF0#61]
   +- Scan ExistingRDD[key#15]

In Spark >= 3.0 getItem should be replaced with __getitem__ ([]), i.e:

from pyspark.sql.functions import col, create_map, lit
from itertools import chain

mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])

df.withColumn("value", mapping_expr[col("key")])

zero323 - Having trouble understanding this method. ```withColumn``` takes a ```colName``` string as first arg, but you're passing a map value instead of key? ```df.withColumn("value", mapping_expr.getItem(col("key")))``` Also result from ```create_map``` looks like this and calling getItem() like above is not working for me: ```Column``` Any thoughts? — Andrew Hummel, Apr 01 '20 at 17:14
Thank you so much for this solution ! It works like a charm. Also if you wish to do the same behaviour with a list instead of the fict : `from pyspark.sql.functions import array` For example: `values = [1, 2, 3, 4, 5]` `indexing_expr = array(*[lit(x) for x in values]) # The * is important` `df.withColumn("value", indexing_expr[col("index")])` — LePuppy, Nov 12 '20 at 15:28

score 41 · Answer 2 · answered Mar 06 '19 at 15:11

41

Sounds like the simplest solution would be to use the replace function: http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace

mapping= {
        'A': '1',
        'B': '2'
    }
df2 = df.replace(to_replace=mapping, subset=['yourColName'])

answered Mar 06 '19 at 15:11

Haim Bendanan

741
7
13

4

The problem here is that this will not create a new column, it will replace the values in the original one. – Shaido Jun 12 '19 at 09:16
6

Couldn't you just first copy the old values into a new column first, then use this function? – SummerEla Jan 30 '20 at 22:35
8

Replace also requires the new values to be the same type as the original column. – Kieran Cooney Jun 16 '20 at 13:11
5

I like the shortness of this solution. If you want an extra column just copy the column using `.withColumn("newColumn", "column_to_copy")` or so - The example just provides the minimum code you need to know to do this yourselfs :) Sometimes I think comments on SO are just used to be pedantic.. – Markus Aug 04 '20 at 15:51
1

if someone could benchmark this solution against the accepted one, that would be very helpful. I implemented the `replace` but I does slow down my code considerably which puzzles me. – safex Oct 12 '21 at 06:45

ZygD · Answer 3 · 2022-08-05T13:23:15.840

Without itertools import, list comprehensions deal with it very nicely.

Map from dict:

F.create_map([F.lit(x) for i in dic.items() for x in i])

Extracting values:

F.create_map([F.lit(x) for i in dic.items() for x in i])[F.col('col1')]

Full test:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('A',), ('E',), ('INVALID',)],
    ['col1']
)
dic = {'A': 'S', 'B': 'S', 'E': 'NS'}

map_col = F.create_map([F.lit(x) for i in dic.items() for x in i])
df = df.withColumn('col2', map_col[F.col('col1')])

df.show()
# +-------+----+
# |   col1|col2|
# +-------+----+
# |      A|   S|
# |      E|  NS|
# |INVALID|null|
# +-------+----+

score 1 · Answer 4 · answered Mar 22 '21 at 14:27

If you want to create a map col from a nested dictionary you can use this:

def create_map(d,):
    if type(d) != dict:
        return F.lit(d)

    level_map = []
    for k in d:
        level_map.append(F.lit(k))
        level_map.append(create_map(d[k]))
    return F.create_map(level_map)

d = {'a': 1, 'b': {'c': 2, 'd': 'blah'}}
print(create_map(d)) # <- Column<b'map(a, 1, b, map(c, 2, d, blah))'>

score 0 · Answer 5 · answered Mar 17 '22 at 12:43

0

you can use the function which convert dictionary into case syntax in Spark SQL

func_mapper = lambda dic,col,default : f"(CASE {col} WHEN " + " WHEN ".join([ f"'{k}' THEN '{v}'" for (k,v) in dic.items() ]) + f" ELSE '{default}' END)"

answered Mar 17 '22 at 12:43

Ashwini Jindal

811
8
15

score 0 · Answer 6 · answered Jan 07 '23 at 10:05

In case anyone needs to map null values as well, the accepted answer didn't work for me. The problem with map type is it can't handle null-valued keys.

But we can replace it with a generated CASE WHEN statement and use isNull instead of == None:

from pyspark.sql import functions as F
from functools import reduce

d = spark.sparkContext.parallelize([('A', ), ('B', ), (None, ), ('INVALID', )]).toDF(['key'])

mapping = {'A': '1', 'B': '2', None: 'empty'}
map_tuples = list(mapping.items())

def email_eq_null_safe(key):
    if key is None:
        return F.col('key').isNull()
    else:
        return F.col('key') == key

'''
F.when(
    F.col('key') == key1,
    value1
).when(
    F.col('key') == key2,
    value2
)....
'''
whens = reduce(
    lambda prev, nxt: prev.when(email_eq_null_safe(nxt[0]), nxt[1]), 
    map_tuples[1:],
    F.when(email_eq_null_safe(map_tuples[0][0]), map_tuples[0][1])
)

d.select(
    'key',
    whens.alias('value')
).show()

+-------+-----+
|    key|value|
+-------+-----+
|      A|    1|
|      B|    2|
|   null|empty|
|INVALID| null|
+-------+-----+

PySpark create new column with mapping from a dict

6 Answers6

Linked

Related