0

I have dataframe with column containing json string, which is converted to dictionary using from_json function. Problem occured when json contains not typical string inside like: '\\"cde\\"', all json: '{"key":"abc","value":"\\"cde\\""}'.

When from_json function is applied, it returns null because I think it treats \\ as one char and it can not parse value due to many " inside.

Here is simple code snippet:

df = spark.createDataFrame(
    [
        (1, '{"key":"abc","value":"\\\\"cde\\\\""}')
    ],
    ["id", "text"]
)

df = df.withColumn('dictext', from_json(col('text'), json_schema))

display(df)

Is there way for cleaning such json or maybe encoding it somehow before callingfrom_json or using another function, which is able to parse such string?

gorrch
  • 521
  • 3
  • 16

1 Answers1

1

Is there way for cleaning such json

For your case, I would suggest creating an UDF, that captures the cleaning rules that are relevant to your data. For the single line of data you have included, I created a sample UDF that removes all incorrect tokens and parses the JSON correctly:

from pyspark.sql.functions import udf

@udf("string")
def clean_json(json: str):
    return json.replace("\\", "").replace("\"\"", "\"")

# Applying the UDF
df = df.withColumn('dictext', from_json(clean_json(col('text')), json_schema))
display(df)

display(df)

Cleaning rows with regex

If you can capture all your unwanted characters with regular expression, then you don't need the UDF - you use your regex with regexp_replace function directly, like this:

from pyspark.sql.functions import regexp_replace

df = df.withColumn('dictext', from_json(regexp_replace('text', r'\\', '')), json_schema))

Docs for regexp_replace

Bartosz Gajda
  • 984
  • 6
  • 14
  • With this solution I need each time to add cleaning rule if I receive other char e.g. "\+" – gorrch Nov 03 '22 at 07:58
  • Hi gorrch, I've added an example of cleaning using regular expression - if you can build your regex to capture all applicable cases, then you don't need to any extra rules whenever something new pops in :) – Bartosz Gajda Nov 03 '22 at 17:11