2

I'm writing a PySpark job to read the Values column from table1. Table1 has two column -> ID, Values

Sample data in the Values column:

+----+-----------------------------------+
| ID |               values              |
+----+-----------------------------------+
| 1  | a=10&b=2&c=13&e=55&d=78&j=98&l=99 |
| 2  | l=22&e=67&j=34&a=7&c=9&d=77&b=66  |
+----+-----------------------------------+

I have to read the values column from a delta table and split it. Then I have to store it in another delta table as depicted below:

+----+-----------------------------------+
| ID | a  | b  | c  |  d  | e  | j |  l  |
+----+-----------------------------------+
| 1  | 10 | 2  | 13 |  78 | 55 | 98 | 99 |
| 2  |  7 | 66 | 9  |  77 | 67 | 34 | 22 |
+----+-----------------------------------+

Any suggestion to resolve this would be helpful.

boring-coder
  • 63
  • 1
  • 5
  • I have tried asking chatgpt to get you started. First, we can use the split function to split the string on "&". Then we can explode to make new rows for each character in the string. Then we split again, this time on "=". Now the first part of that split is used to pivot the table on, such that your keys become the column names. I'd say try asking chatgpt for the code. – Tessa I Feb 08 '23 at 14:08
  • Thank you but I'm on the waiting list. ChatGPT is not accepting any more signup. I think I have to go with custom logic to achieve it. – boring-coder Feb 08 '23 at 14:17

2 Answers2

2

You can do the following

from pyspark.sql import functions as F

(
    df
    .withColumn("values", F.explode(F.split(F.col("values"), "&", limit=0)))
    .withColumn("tag", F.regexp_extract(F.col("values"),"^[a-z]+",0))
    .withColumn("values",F.regexp_replace(F.col("values"),"^[a-z]+[=]",""))
    .groupby("ID")
    .pivot("tag")
    .agg(F.first(F.col("values")))
    .show()
 )

Output:

| ID|  a|  b|  c|  d|  e|  j|  l|
+---+---+---+---+---+---+---+---+
|  1| 10|  2| 13| 78| 55| 98| 99|
|  2|  7| 66|  9| 77| 67| 34| 22|
+---+---+---+---+---+---+---+---+
langtang
  • 22,248
  • 1
  • 12
  • 27
0

You can convert values column to map type by using transform function after splitting. After conversion select all keys from map.

df = spark.createDataFrame([(1, "a=10&b=2&c=13&e=55&d=78&j=98&l=99"),
                            (2, "l=22&e=67&j=34&a=7&c=9&d=77&b=66 ")],
                           ["ID", "values"])

transformed_df = \
    df.withColumn("values",
                  expr("transform(split(values, '&'), c-> map(split(c, '=')[0], cast(split(c, '=')[1] as int)))")) \
        .withColumn("values",
                    aggregate("values", create_map().cast("map<string,int>"), lambda acc, m: map_concat(acc, m))) 
 
    # if alphabet values are fixed
keys = ['a', 'b', 'c', 'd', 'e', 'j', 'l']
# else or to  avoid hardcoded values
keys = sorted(transformed_df.select(explode_outer('values')). \
              select(collect_set("key").alias("key")).first().asDict().get("key"))
transformed_df.select("ID", *[col("values").getItem(k).alias(k) for k in keys]).show()

+---+---+---+---+---+---+---+---+
| ID|  a|  b|  c|  d|  e|  j|  l|
+---+---+---+---+---+---+---+---+
|  1| 10|  2| 13| 78| 55| 98| 99|
|  2|  7| 66|  9| 77| 67| 34| 22|
+---+---+---+---+---+---+---+---+
Mohana B C
  • 5,021
  • 1
  • 9
  • 28