1

I have a dataframe as below:

+-----------------------------------------------------------------+
|ID|DATASET                                                       |
+-----------------------------------------------------------------+
|4A|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"}]   |
|8B|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"}]   |
+-----------------------------------------------------------------+

Expected Output:

+----------------------------------+
|ID|col_1 | col_2 | col_3| col_4   |
+----------------------------------+
|4A|"12ABC"|"141"||"ABCD"          |
|8B|"12ABC"|"141"||"ABCD"          |
+--------------------------------- +
  • i tried to use regex_extract :

df.withColumn("col_1", regexp_extract("DATASET", "(?<=col.1:)\w+(?=(,|}|]))", 0)).withColumn("col_2", regexp_extract("DATASET", "(?<=col.2:)\w+(?=(,|}|]))", 0))

but getting null values in the result

+----------------------------------+
|ID|col_1 | col_2 | col_3| col_4   |
+----------------------------------+
|4A||||                            |
|8B||||                            |
+--------------------------------- +

Any input on this

Thanks in advance

EDITED

Thanks for the response, that works fine, my input getting changed a little bit, i want to group based on col1 and place values in separate rows

Updated Dataset:

+---------------------------------------------------------------------------------------------------------------------------+
|ID|DATASET                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------+
|4A|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}]   |
+---------------------------------------------------------------------------------------------------------------------------+

Expected Result:

+-----------------------------------------------------------------+
|ID|col_1 | col_2 | col                                           |
+-----------------------------------------------------------------+
|4A|"12ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}"           |
|4A|"13ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}"           |
+-----------------------------------------------------------------+

Thanks in advance

Prabhanj
  • 262
  • 2
  • 3
  • 16

1 Answers1

0

try using structfield

from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType(
[
    StructField('`col.1`', StringType(), True),
    StructField('`col.2`', StringType(), True),
    StructField('`col.3`', StringType(), True),
    StructField('`col.4`', StringType(), True),
]
)

df.withColumn("DATASET", from_json("DATASET", schema))\
.select(col('ID'), col('data.*'))\
.show()
Jay Kakadiya
  • 501
  • 1
  • 5
  • 12