how to Convert JSON String to JSON object in pyspark

Question

I have one of column type of data frame is string but actually it is containing json object of 4 schema where few fields are common. I need to convert that into jason object.

Here is schema of data frame :

query.printSchema()

root
 |-- test: string (nullable = true)

value of DF looks like

query.show(10)

+--------------------+
|                test|
+--------------------+
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"Interaction":{"...|
|{"PurchaseActivit...|
|{"Interaction":{"...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
+--------------------+
only showing top 10 rows

What solution i applied ::

write into text file

query.write.format("text").mode('overwrite').save("s3://bucketname/temp/")

read as json

df = spark.read.json("s3a://bucketname/temp/")

now print Schema, It is json string for each row already converted into json object

df.printSchema()

root
 |-- EventDate: string (nullable = true)
 |-- EventId: string (nullable = true)
 |-- EventNotificationType: long (nullable = true)
 |-- Interaction: struct (nullable = true)
 |    |-- ContextId: string (nullable = true)
 |    |-- Created: string (nullable = true)
 |    |-- Description: string (nullable = true)
 |    |-- Id: string (nullable = true)
 |    |-- ModelContextId: string (nullable = true)
 |-- PurchaseActivity: struct (nullable = true)
 |    |-- BillingCity: string (nullable = true)
 |    |-- BillingCountry: string (nullable = true)
 |    |-- ShippingAndHandlingAmount: double (nullable = true)
 |    |-- ShippingDiscountAmount: double (nullable = true)
 |    |-- SubscriberId: long (nullable = true)
 |    |-- SubscriptionOriginalEndDate: string (nullable = true)
 |-- SubscriptionChurn: struct (nullable = true)
 |    |-- PaymentTypeCode: long (nullable = true)
 |    |-- PaymentTypeName: string (nullable = true)
 |    |-- PreviousPaidAmount: double (nullable = true)
 |    |-- SubscriptionRemoved: string (nullable = true)
 |    |-- SubscriptionStartDate: string (nullable = true)
 |-- TransactionDetail: struct (nullable = true)
 |    |-- Amount: double (nullable = true)
 |    |-- OrderShipToCountry: string (nullable = true)
 |    |-- PayPalUserName: string (nullable = true)
 |    |-- PaymentSubTypeCode: long (nullable = true)
 |    |-- PaymentSubTypeName: string (nullable = true)

Is there any best way where i don't need to write dataframe as text file and read it again as json file to get expected output

Have you tried the solution in [this answer](https://stackoverflow.com/a/41108013/5858851) or perhaps [this answer](https://stackoverflow.com/a/45880574/5858851)? — pault, Apr 12 '18 at 01:09

score -1 · Answer 1 · answered Dec 31 '18 at 05:00

You can use from_json() before you write into text file, but you need to define the schema first.

the code look like this :

data = query.select(from_json("test",schema=schema).alias("value")).selectExpr("value.*")

data.write.format("text").mode('overwrite').save("s3://bucketname/temp/")

how to Convert JSON String to JSON object in pyspark

1 Answers1