0

I am building a schema for the dataset below from a hive table.

enter image description here

After processing I have to write the data to S3.

I need to restructure and group the user id interaction based on date attached json image format to be prepared.

For building this schema i have prepared a struct type with array.

fields = [
    StructField("expUserId", StringType(), True),
    StructField("recordDate", StringType(), True),
    StructField("siteId", StringType(), True),
    StructField("siteName", StringType(), True),
    StructField("itineraryNumber", StringType(), True),
    StructField("travelStartDate", StringType(), True),
    StructField("travelEndDate", StringType(), True),
    StructField("destinationID", StringType(), True),
    StructField("lineOfBusiness", StringType(), True),
    StructField("pageViewMap", MapType(StringType(),ArrayType(StructType([
        StructField("PageId", StringType(), True),
        StructField("count", LongType(), True)]))), True)
    ]
schema = StructType(fields)
return schema

Is this schema correct? How to convert the DataFrame to the below json schema type.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • Can you please replace the screenshots and use copy the raw text instead. That's gonna be much easier to give you some additional hints as to how to work on the dataset. Thanks! – Jacek Laskowski May 10 '17 at 12:58

3 Answers3

0

Why wouldn't you just use a SparkSession to read in the json use schema to show the interpreted structure?

spark.read.json(inputPath).schema
Tom Lous
  • 2,819
  • 2
  • 25
  • 46
0

If your dataset is in Hive, read it using a JDBC or Hive integration layer (see Hive Tables or JDBC To Other Databases in the official documentation of Spark).

It is as simple as spark.read.format("jdbc")...load() or spark.read.table respectively (see DataFrameReader API in the official documentation).

What's nice about this approach is that Spark can automatically infer the schema for you (so you can leave that out and have more time for yourself!)

Once the dataset is in your hands as a DataFrame or Dataset, you can save it to S3 in JSON format as follows:

inventoryDF.write.format("json").save("s3n://...")

See JSON Datasets and DataFrameWriter API in the official documentation.

I strongly recommend letting Spark do the hard work so you don't have to.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
0

You can create new dataframe from json with your own defined schema.

val myManualSchema = new StructType(Array(
  new StructField("column1", StringType, true),
  new StructField("column2", LongType, false)
))

val myDf = spark.read.format("json")
                .schema(myManualSchema)
                .load('/x/y/zddd.json')

dataframe can be created without specifying schema manually. So spark will generate schema by evaluating input file.

val df = spark.read.format("json").load("/x/y/zddd.json")

read the schema from json using below command.

val SchJson = spark.read.format("json").load("/x/y/zddd.json").schema