1

I am trying to read the schema stored in text file in hdfs and use it while creating a DataFrame.

schema=StructType([
StructField("col1",StringType(),True),
StructField("col2",StringType(),True),
StructField("col3",TimestampType(),True),
StructField("col4",
StructType([
StructField("col5",StringType(),True),
StructField("col6",
.... and so on

jsonDF = spark.read.schema(schema).json('/path/test.json')

Since the schema is too big I want to defined inside the code. Can anyone please suggest which is the best way to do.

I tried below ways but doesn't work.

schema = sc.wholeTextFiles("hdfs://path/sample.schema"))
schema = spark.read.text('/path/sample.schema')
jakrm
  • 183
  • 2
  • 3
  • 11

2 Answers2

1

I figured out how to do this.

1. Define the schema of json file 

json.schema=StructType([
StructField("col1",StringType(),True),
StructField("col2",StringType(),True),
StructField("col3",TimestampType(),True),
StructField("col4",
StructType([
StructField("col5",StringType(),True),
StructField("col6",

2. Print the json output

print(sampletmp.json()) 

3. Copy paste the above output to file sample.schema

4. In the code, recreate the schema as below

schema_file = 'path/sample.schema'
schema_json = spark.read.text(schema_file).first()[0]
schema = StructType.fromJson(json.loads(schema_json))

5. Create a DF using above schema

spark.read.schema(schema).json('/path/test.json')

6. Insert the data from DF into Hive table
jsonDF.write.mode("append").insertInto("hivetable")

Referred to the article - https://szczeles.github.io/Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark/

jakrm
  • 183
  • 2
  • 3
  • 11
0

I haven't tested it with hdfs but I assume it is similar to reading from a local file. The idea is to store the file as a dict and then parse it to create the desidered schema. I have taken inspiration from here. Currently it lacks support for nullable and I have not tested with deeper levels of nested structs.

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from fractions import Fraction
from pyspark.sql.functions import udf
import json

spark = SparkSession.builder.appName('myPython').getOrCreate()

f = open("/path/schema_file", "r")

dictString = f.read()

derived_schema = StructType([])

jdata = json.loads(dictString)


def get_type(v):
    if v == "StringType":
        return StringType()
    if v == "TimestampType":
        return TimestampType()
    if v == "IntegerType":
        return IntegerType()


def generate_schema(jdata, derived_schema):
    for k, v in sorted(jdata.items()):
        if (isinstance(v, str)):
            derived_schema.add(StructField(k, get_type(v), True))
        else:
            added_schema = StructType([])
            added_schema = generate_schema(v, added_schema)
            derived_schema.add(StructField(k, added_schema, True))
    return derived_schema


generate_schema(jdata, derived_schema)

from datetime import datetime

data = [("first", "the", datetime.utcnow(), ["as", 1])]

input_df = spark.createDataFrame(data, derived_schema)

input_df.printSchema()

With the file being:

{
  "col1" : "StringType",
  "col2" : "StringType",
  "col3" : "TimestampType",
  "col4" : {
    "col5" : "StringType",
    "col6" : "IntegerType"
  }
}
LizardKing
  • 601
  • 6
  • 13
  • I already have my schema defined in schema (above) as StructType and just want to use it while creating the DF. I tried, f = open("/path/sample.schema", "r") schema = f.read() as per your code. But when I try to use it like this jsonDF = spark.read.schema(schema).json('/path/test.json') I get an error 'TypeError: schema should be StructType'. – jakrm Jul 25 '19 at 10:10
  • Above code works for struct data types but not arrays. How can this be modified to handle arrays? – jakrm Jul 25 '19 at 11:50