I am building a schema for the dataset below from a hive table.
After processing I have to write the data to S3.
I need to restructure and group the user id interaction based on date attached json image format to be prepared.
For building this schema i have prepared a struct type with array.
fields = [
StructField("expUserId", StringType(), True),
StructField("recordDate", StringType(), True),
StructField("siteId", StringType(), True),
StructField("siteName", StringType(), True),
StructField("itineraryNumber", StringType(), True),
StructField("travelStartDate", StringType(), True),
StructField("travelEndDate", StringType(), True),
StructField("destinationID", StringType(), True),
StructField("lineOfBusiness", StringType(), True),
StructField("pageViewMap", MapType(StringType(),ArrayType(StructType([
StructField("PageId", StringType(), True),
StructField("count", LongType(), True)]))), True)
]
schema = StructType(fields)
return schema
Is this schema correct? How to convert the DataFrame to the below json schema type.