2

I have one of column type of data frame is string but actually it is containing json object of 4 schema where few fields are common. I need to convert that into jason object.

Here is schema of data frame :

query.printSchema()

root
 |-- test: string (nullable = true)

value of DF looks like

query.show(10)

+--------------------+
|                test|
+--------------------+
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"Interaction":{"...|
|{"PurchaseActivit...|
|{"Interaction":{"...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
+--------------------+
only showing top 10 rows

What solution i applied ::

  1. write into text file

query.write.format("text").mode('overwrite').save("s3://bucketname/temp/")

  1. read as json

df = spark.read.json("s3a://bucketname/temp/")

  1. now print Schema, It is json string for each row already converted into json object

df.printSchema()

root
 |-- EventDate: string (nullable = true)
 |-- EventId: string (nullable = true)
 |-- EventNotificationType: long (nullable = true)
 |-- Interaction: struct (nullable = true)
 |    |-- ContextId: string (nullable = true)
 |    |-- Created: string (nullable = true)
 |    |-- Description: string (nullable = true)
 |    |-- Id: string (nullable = true)
 |    |-- ModelContextId: string (nullable = true)
 |-- PurchaseActivity: struct (nullable = true)
 |    |-- BillingCity: string (nullable = true)
 |    |-- BillingCountry: string (nullable = true)
 |    |-- ShippingAndHandlingAmount: double (nullable = true)
 |    |-- ShippingDiscountAmount: double (nullable = true)
 |    |-- SubscriberId: long (nullable = true)
 |    |-- SubscriptionOriginalEndDate: string (nullable = true)
 |-- SubscriptionChurn: struct (nullable = true)
 |    |-- PaymentTypeCode: long (nullable = true)
 |    |-- PaymentTypeName: string (nullable = true)
 |    |-- PreviousPaidAmount: double (nullable = true)
 |    |-- SubscriptionRemoved: string (nullable = true)
 |    |-- SubscriptionStartDate: string (nullable = true)
 |-- TransactionDetail: struct (nullable = true)
 |    |-- Amount: double (nullable = true)
 |    |-- OrderShipToCountry: string (nullable = true)
 |    |-- PayPalUserName: string (nullable = true)
 |    |-- PaymentSubTypeCode: long (nullable = true)
 |    |-- PaymentSubTypeName: string (nullable = true)

Is there any best way where i don't need to write dataframe as text file and read it again as json file to get expected output

pault
  • 41,343
  • 15
  • 107
  • 149
Arvind-MSFT
  • 131
  • 2
  • 3
  • Have you tried the solution in [this answer](https://stackoverflow.com/a/41108013/5858851) or perhaps [this answer](https://stackoverflow.com/a/45880574/5858851)? – pault Apr 12 '18 at 01:09

1 Answers1

-1

You can use from_json() before you write into text file, but you need to define the schema first.

the code look like this :

data = query.select(from_json("test",schema=schema).alias("value")).selectExpr("value.*")

data.write.format("text").mode('overwrite').save("s3://bucketname/temp/")