0

I am working on a database where the data is stored in csv format. The DB looks like the following:

id containertype size
1 CASE {height=2.01, length=1.07, width=1.22}
2 PALLET {height=1.80, length=1.07, width=1.23}

I want to parse the data inside size column and create a pyspark df like:

id containertype height length width
1 CASE 2.01 1.07 1.22
2 PALLET 1.80 1.07 1.23

I tried parsing the string to StructType and MapType but none of the approaches are working. Is there any way to do it except the messy string manipulation?

Reproducible data-frame code:

df = spark.createDataFrame(
    [
        ("1", "CASE", "{height=2.01, length=1.07, width=1.22}"),
        ("2", "PALLET", "{height=2.01, length=1.07, width=1.22}"),
    ],
    ["id", "containertype", "size"]
)

df.printSchema()
biswas N
  • 381
  • 1
  • 16

2 Answers2

-1

If one of the columns is a JSON, you can parse it with the function to_json, which requires the column you want to parse, in your case size, and the schema that will result in the parsing, in this case:

schema = StructType([ \
    StructField("height",FloatType(),True), \
    StructField("length",FloatType(),True), \
    StructField("width",FloatType(),True)
])

df.withColumn("json", F.from_json(F.col("size"), schema))\
  .select(F.col("id"), F.col("containertype"), F.col("json.*"))
Alfilercio
  • 1,088
  • 6
  • 13
-1

Use a regex to extract the value

def getParameter(tag):
    return F.regexp_extract("size", tag+"=(\d+\.\d+)", 1).cast(FloatType()).alias(tag)

df.select(F.col("id"), F.col("containertype"), getParameter("height"), getParameter("length"), getParameter("width"))
Alfilercio
  • 1,088
  • 6
  • 13