1

Need your help in defining a dynamic schema with fields and datatypes from input metadata JSon file .Below is the JSon file

[
  {
    "trim": true,
    "name": "id",
    "nullable": true,
    "id": null,
    "position": 0,
    "table": "employee",
    "type": "integer",
    "primaryKey": true
  },
  {
    "trim": true,
    "name": "salary",
    "nullable": true,
    "id": null,
    "position": 1,
    "table": "employee",
    "type": "double",
    "primaryKey": false
  },
  {
    "trim": true,
    "name": "dob",
    "nullable": true,
    "id": null,
    "position": 2,
    "table": "employee",
    "type": "date",
    "primaryKey": false
  }
]

Found a useful link but all the fields are mapped to string data type . Programmatically generate the schema AND the data for a dataframe in Apache Spark

My requirement is bit different .I have an input CSV file without any header contents and therefore all column values are of string datatype.Similarly I have a JSON metadata file that contains the field name and corresponding data types.I want to define a schema that maps the field name with corresponding datatype from JSON to input CSV data. For example :Below is the sample code I have written for mapping the column names from JSON file to input CSV data . But it doesn't convert or map the columns to the corresponding datatype .

val in_emp = spark.read.csv(in_source_data)                                                  
val in_meta_emp = spark.read.option("multiline","true").json(in_meta_data)    
val in_cols = in_meta_emp.select("name","type").map(_.getString(0)).collect    
val in_cols_map = in_emp.toDF(in_cols:_*)    
in_emp.show                                                                                 
in_cols_map.show                                                                                              

in_cols_map.dtypes          

Result :

Please click to see the result

3 :mapped Input datatypes
Array[(String, String)] = Array((id,StringType), (salary,StringType), (dob,StringType))

Below code depicts the static way to define the schema but I am looking for dynamic way that picks the column and corresponding data type from JSON metadata file.

val schema =     StructType   (Array(     
                        StructField("id",IntegerType,true ),    
                        StructField("dob",DateType,true ),    
                        StructField("salary",DoubleType,true )    
                        ))                                                                                      



val in_emp =     
    spark.read    
    .schema(schema)    
    .option("inferSchema","true")    
    .option("dateFormat", "yyyy.MM.dd")    
    .csv(in_source_data)
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
Safi
  • 39
  • 1
  • 4
  • Can you explain more about your problem statement. ? If you have tired to code some things then edit your question with your code. If my understanding correct, you have to define the Data Frame Schema using you meta data information and table name as well dynamic. – Rumesh Krishnan Feb 26 '18 at 05:41
  • I have a question along the same lines- https://stackoverflow.com/questions/54745559/dynamically-generated-schema-in-json-file-for-a-csv-data-file-in-scala – Mamta Feb 18 '19 at 11:12

0 Answers0