0

I have a dictionary like this below

dictionary_Tag = {'A':'unitA&',
'B':'B&',
'C':'unitC',
'D':'D#' }

and a table like this

|item_name|item_value|timestamp                   |idx|
+---------+----------+----------------------------+---+
|A        |0.25      |2023-03-01T17:20:00.000+0000|0  |
|B        |0.34      |2023-03-01T17:20:00.000+0000|0  |
|A        |0.3       |2023-03-01T17:25:00.000+0000|1  |
|B        |0.54      |2023-03-01T17:25:00.000+0000|1  |
|A        |0.3       |2023-03-01T17:30:00.000+0000|2  |
|B        |0.54      |2023-03-01T17:30:00.000+0000|2  |

I want to add one more column named "description" to the table like this

|item_name|item_value|timestamp                   |idx|description|
+---------+----------+----------------------------+---+-----------+
|A        |0.25      |2023-03-01T17:20:00.000+0000|0  |unitA&     |
|B        |0.34      |2023-03-01T17:20:00.000+0000|0  |B&         |
|A        |0.3       |2023-03-01T17:25:00.000+0000|1  |unitA&     |
|B        |0.54      |2023-03-01T17:25:00.000+0000|1  |B&         |
|A        |0.3       |2023-03-01T17:30:00.000+0000|2  |unitA&     |
|B        |0.54      |2023-03-01T17:30:00.000+0000|2  |B&         |

How can I do this using Pyspark? would appreciate any help

MMV
  • 164
  • 10

2 Answers2

1

Dict to spark dataframe, and join with the table.

df  = spark.read.csv('test.csv', header=True, inferSchema=True)
df2 = spark.createDataFrame(dictionary_Tag.items(), ['item_name', 'description'])

df.join(df2, ['item_name'], 'left').show()

+---------+----------+-------------------+---+-----------+
|item_name|item_value|          timestamp|idx|description|
+---------+----------+-------------------+---+-----------+
|        B|      0.34|2023-03-01 17:20:00|  0|         B&|
|        B|      0.54|2023-03-01 17:25:00|  1|         B&|
|        B|      0.54|2023-03-01 17:30:00|  2|         B&|
|        A|      0.25|2023-03-01 17:20:00|  0|     unitA&|
|        A|       0.3|2023-03-01 17:25:00|  1|     unitA&|
|        A|       0.3|2023-03-01 17:30:00|  2|     unitA&|
+---------+----------+-------------------+---+-----------+
Lamanus
  • 12,898
  • 4
  • 21
  • 47
0

With the help of UDF function, this can be added in a line.

dictionary_Tag = {
    'A':'unitA&',
    'B':'B&',
    'C':'unitC',
    'D':'D#'}

The code

from pyspark.sql.functions import udf

match_udf = udf(lambda item_name: dictionary_Tag[item_name])

df.withColumn("description", match_udf("item_name")).show()

Output:

+---------+----------+----------------------------+---+-----------+
|item_name|item_value|timestamp                   |idx|description|
+---------+----------+----------------------------+---+-----------+
|A        |0.25      |2023-03-01T17:20:00.000+0000|0  |unitA&     |
|B        |0.34      |2023-03-01T17:20:00.000+0000|0  |B&         |
|A        |0.3       |2023-03-01T17:25:00.000+0000|1  |unitA&     |
|B        |0.54      |2023-03-01T17:25:00.000+0000|1  |B&         |
|A        |0.3       |2023-03-01T17:30:00.000+0000|2  |unitA&     |
|B        |0.54      |2023-03-01T17:30:00.000+0000|2  |B&         |
+---------+----------+----------------------------+---+-----------+
arudsekaberne
  • 830
  • 4
  • 11