Pyspark from_json function equvalent

Question

I have dataframe with column containing json string, which is converted to dictionary using from_json function. Problem occured when json contains not typical string inside like: '\\"cde\\"', all json: '{"key":"abc","value":"\\"cde\\""}'.

When from_json function is applied, it returns null because I think it treats \\ as one char and it can not parse value due to many " inside.

Here is simple code snippet:

df = spark.createDataFrame(
    [
        (1, '{"key":"abc","value":"\\\\"cde\\\\""}')
    ],
    ["id", "text"]
)

df = df.withColumn('dictext', from_json(col('text'), json_schema))

display(df)

Is there way for cleaning such json or maybe encoding it somehow before callingfrom_json or using another function, which is able to parse such string?

Bartosz Gajda · Answer 1 · 2022-11-03T17:10:34.057

Is there way for cleaning such json

For your case, I would suggest creating an UDF, that captures the cleaning rules that are relevant to your data. For the single line of data you have included, I created a sample UDF that removes all incorrect tokens and parses the JSON correctly:

from pyspark.sql.functions import udf

@udf("string")
def clean_json(json: str):
    return json.replace("\\", "").replace("\"\"", "\"")

# Applying the UDF
df = df.withColumn('dictext', from_json(clean_json(col('text')), json_schema))
display(df)

Cleaning rows with regex

If you can capture all your unwanted characters with regular expression, then you don't need the UDF - you use your regex with regexp_replace function directly, like this:

from pyspark.sql.functions import regexp_replace

df = df.withColumn('dictext', from_json(regexp_replace('text', r'\\', '')), json_schema))

Docs for regexp_replace

With this solution I need each time to add cleaning rule if I receive other char e.g. "\+" — gorrch, Nov 03 '22 at 07:58
Hi gorrch, I've added an example of cleaning using regular expression - if you can build your regex to capture all applicable cases, then you don't need to any extra rules whenever something new pops in :) — Bartosz Gajda, Nov 03 '22 at 17:11

Pyspark from_json function equvalent

1 Answers1

Cleaning rows with regex