Need your help in defining a dynamic schema with fields and datatypes from input metadata JSon file .Below is the JSon file
[
{
"trim": true,
"name": "id",
"nullable": true,
"id": null,
"position": 0,
"table": "employee",
"type": "integer",
"primaryKey": true
},
{
"trim": true,
"name": "salary",
"nullable": true,
"id": null,
"position": 1,
"table": "employee",
"type": "double",
"primaryKey": false
},
{
"trim": true,
"name": "dob",
"nullable": true,
"id": null,
"position": 2,
"table": "employee",
"type": "date",
"primaryKey": false
}
]
Found a useful link but all the fields are mapped to string data type . Programmatically generate the schema AND the data for a dataframe in Apache Spark
My requirement is bit different .I have an input CSV file without any header contents and therefore all column values are of string datatype.Similarly I have a JSON metadata file that contains the field name and corresponding data types.I want to define a schema that maps the field name with corresponding datatype from JSON to input CSV data. For example :Below is the sample code I have written for mapping the column names from JSON file to input CSV data . But it doesn't convert or map the columns to the corresponding datatype .
val in_emp = spark.read.csv(in_source_data)
val in_meta_emp = spark.read.option("multiline","true").json(in_meta_data)
val in_cols = in_meta_emp.select("name","type").map(_.getString(0)).collect
val in_cols_map = in_emp.toDF(in_cols:_*)
in_emp.show
in_cols_map.show
in_cols_map.dtypes
Result :
Please click to see the result
3 :mapped Input datatypes
Array[(String, String)] = Array((id,StringType),
(salary,StringType), (dob,StringType))
Below code depicts the static way to define the schema but I am looking for dynamic way that picks the column and corresponding data type from JSON metadata file.
val schema = StructType (Array(
StructField("id",IntegerType,true ),
StructField("dob",DateType,true ),
StructField("salary",DoubleType,true )
))
val in_emp =
spark.read
.schema(schema)
.option("inferSchema","true")
.option("dateFormat", "yyyy.MM.dd")
.csv(in_source_data)