How to parse nested column for CSV data in Pyspark?

Question

I am working on a database where the data is stored in csv format. The DB looks like the following:

id	containertype	size
1	CASE	{height=2.01, length=1.07, width=1.22}
2	PALLET	{height=1.80, length=1.07, width=1.23}

I want to parse the data inside size column and create a pyspark df like:

id	containertype	height	length	width
1	CASE	2.01	1.07	1.22
2	PALLET	1.80	1.07	1.23

I tried parsing the string to StructType and MapType but none of the approaches are working. Is there any way to do it except the messy string manipulation?

Reproducible data-frame code:

df = spark.createDataFrame(
    [
        ("1", "CASE", "{height=2.01, length=1.07, width=1.22}"),
        ("2", "PALLET", "{height=2.01, length=1.07, width=1.22}"),
    ],
    ["id", "containertype", "size"]
)

df.printSchema()

Can you paste the value a row of the csv? it looks like you have a JSON inside the column size, is that correct? — Alfilercio, Mar 15 '22 at 12:54
@Alfilercio yes. The data inside the column `size` is similar to json. — biswas N, Mar 15 '22 at 13:08
@biswasN they mean the shortest code to reproduce this dataframe in our spark sessions, and preferably the output of printSchema() too for adjusting the schema like yours if there is a mismatch. This will help us reproduce your exact issue and the likelihood of receiving better answers increases. — anky, Mar 15 '22 at 14:24
Are the values always in the same format? Same values? Same order? — David דודו Markovitz, Mar 15 '22 at 17:55

score -1 · Answer 1 · answered Mar 15 '22 at 13:17

-1

If one of the columns is a JSON, you can parse it with the function to_json, which requires the column you want to parse, in your case size, and the schema that will result in the parsing, in this case:

schema = StructType([ \
    StructField("height",FloatType(),True), \
    StructField("length",FloatType(),True), \
    StructField("width",FloatType(),True)
])

df.withColumn("json", F.from_json(F.col("size"), schema))\
  .select(F.col("id"), F.col("containertype"), F.col("json.*"))

answered Mar 15 '22 at 13:17

Alfilercio

1,088
6
13

I added the reproducible dataframe code. You are assuming the `size` to be json but this is something similar to json but not exactly json. I am not sure if this is even any standard format. – biswas N Mar 15 '22 at 15:19
replace the `=` with `:` and it can be parsed with the json format – Alfilercio Mar 15 '22 at 15:23
Tried replacing the `=` with `:` but the values of height, width and length are coming as `null`. – biswas N Mar 15 '22 at 16:04
Not even close. The keys are not qualified (E.g. `height` instead of the desired `"height"`) – David דודו Markovitz Mar 15 '22 at 17:53

Alfilercio · Answer 2 · 2022-03-16T08:38:19.350

-1

Use a regex to extract the value

def getParameter(tag):
    return F.regexp_extract("size", tag+"=(\d+\.\d+)", 1).cast(FloatType()).alias(tag)

df.select(F.col("id"), F.col("containertype"), getParameter("height"), getParameter("length"), getParameter("width"))

edited Mar 16 '22 at 08:38

answered Mar 15 '22 at 15:31

Alfilercio

1,088
6
13

1

Test your code (preferably before posting) – David דודו Markovitz Mar 15 '22 at 18:10

How to parse nested column for CSV data in Pyspark?

2 Answers2