0

I'm having trouble with json conversion within pyspark working with complex nested-struct columns. The schema for the from_json doesn't seem to behave. Example:

import pyspark.sql.functions as f

df = spark.createDataFrame([[1,'a'],[2,'b'],[3,'c']], ['rownum','rowchar'])\
.withColumn('struct', f.expr("transform(array(1,2,3), i -> named_struct('a1',rownum*i,'a2',rownum*i*2))"))
df.display()
df.withColumn('struct',f.to_json('struct')).withColumn('struct',f.from_json('struct',df.schema['struct'])).display()
df.withColumn('struct',f.to_json('struct')).withColumn('struct',f.from_json('struct',df.select('struct').schema)).display()

fails with

Cannot parse the schema in JSON format: Failed to convert the JSON string (big JSON string) to a data type

Not sure if this is a syntax error on my end, an edge case that's failing, the wrong way to do things, or something else.

blackbishop
  • 30,945
  • 11
  • 55
  • 76
guest1
  • 46
  • 4

1 Answers1

1

You're not passing the correct schema to from_json. Try with this instead:

df.withColumn('struct', f.to_json('struct')) \
  .withColumn('struct', f.from_json('struct', df.schema["struct"].dataType)) \
  .display()
blackbishop
  • 30,945
  • 11
  • 55
  • 76