How to create a schema for dataset in Hive table?

Question

I am building a schema for the dataset below from a hive table.

After processing I have to write the data to S3.

I need to restructure and group the user id interaction based on date attached json image format to be prepared.

For building this schema i have prepared a struct type with array.

fields = [
    StructField("expUserId", StringType(), True),
    StructField("recordDate", StringType(), True),
    StructField("siteId", StringType(), True),
    StructField("siteName", StringType(), True),
    StructField("itineraryNumber", StringType(), True),
    StructField("travelStartDate", StringType(), True),
    StructField("travelEndDate", StringType(), True),
    StructField("destinationID", StringType(), True),
    StructField("lineOfBusiness", StringType(), True),
    StructField("pageViewMap", MapType(StringType(),ArrayType(StructType([
        StructField("PageId", StringType(), True),
        StructField("count", LongType(), True)]))), True)
    ]
schema = StructType(fields)
return schema

Is this schema correct? How to convert the DataFrame to the below json schema type.

Can you please replace the screenshots and use copy the raw text instead. That's gonna be much easier to give you some additional hints as to how to work on the dataset. Thanks! — Jacek Laskowski, May 10 '17 at 12:58

score 0 · Answer 1 · answered May 09 '17 at 14:14

0

Why wouldn't you just use a SparkSession to read in the json use schema to show the interpreted structure?

spark.read.json(inputPath).schema

answered May 09 '17 at 14:14

Tom Lous

2,819
2
25
46

score 0 · Answer 2 · answered May 10 '17 at 12:57

If your dataset is in Hive, read it using a JDBC or Hive integration layer (see Hive Tables or JDBC To Other Databases in the official documentation of Spark).

It is as simple as spark.read.format("jdbc")...load() or spark.read.table respectively (see DataFrameReader API in the official documentation).

What's nice about this approach is that Spark can automatically infer the schema for you (so you can leave that out and have more time for yourself!)

Once the dataset is in your hands as a DataFrame or Dataset, you can save it to S3 in JSON format as follows:

inventoryDF.write.format("json").save("s3n://...")

See JSON Datasets and DataFrameWriter API in the official documentation.

I strongly recommend letting Spark do the hard work so you don't have to.

Thanks Jacek, sqlContext.read.parquet('s3path') and df.printScema worked — Pradeep.D.s, May 10 '17 at 13:04
@Pradeep.D.s Good! Accept it as the answer (and perhaps upvote) when you find time. Thanks! — Jacek Laskowski, May 10 '17 at 14:37

score 0 · Answer 3 · answered Jun 02 '17 at 21:12

You can create new dataframe from json with your own defined schema.

val myManualSchema = new StructType(Array(
  new StructField("column1", StringType, true),
  new StructField("column2", LongType, false)
))

val myDf = spark.read.format("json")
                .schema(myManualSchema)
                .load('/x/y/zddd.json')

dataframe can be created without specifying schema manually. So spark will generate schema by evaluating input file.

val df = spark.read.format("json").load("/x/y/zddd.json")

read the schema from json using below command.

val SchJson = spark.read.format("json").load("/x/y/zddd.json").schema

How to create a schema for dataset in Hive table?

3 Answers3