How to add the schema in a dataframe from a config file

Question

I have a file which I am converting into Dataframe. For the schema, I want it to be read from a config fle

I don't want to give the schema hardcoded in the code as it might change with time , so we are putting the schema in a separate file.

val searchPath = "/hdfs/cbt/dfgdfgdf_fsdfg/data/noheaderfile"
val columns = "Name,ID,Address,City"

val fields = columns.split(",").map(fieldName => StructField(fieldName, StringType, 
nullable = true))
val customSchema = StructType(fields)
var dfPivot =spark.read.format("com.databricks.spark.csv").option("header","false").option("inferSchema", "false").schema(customSchema).load(searchPath)

Here I want the following line of the code to be changed. val columns = "Name,ID,Address,City"

Instead there should be a file instead which contains the schema.

Please advise.

Pablo López Gallego · Answer 1 · 2019-05-22T08:07:19.943

3

You can find a solution here: How to create a Schema file in Spark

But, you need the type of the columns in your file

import org.apache.spark.sql.types._
val columns = "Name String,ID String,Address String,City String"
val schema = columns
  .split(",")
  .map(_.split(" "))
  .map(x => StructField(x(0), getType(x(1)), true))

The getType is:

def getType(raw: String): DataType = {
  raw match {
    case "ByteType" => ByteType
    case "ShortType" => ShortType
    case "IntegerType" => IntegerType
    case "LongType" => LongType
    case "FloatType" => FloatType
    case "DoubleType" => DoubleType
    case "BooleanType" => BooleanType
    case "TimestampType" => TimestampType
    case _ => StringType
  }
}

edited May 22 '19 at 08:07

answered May 21 '19 at 14:39

Pablo López Gallego

690
4
16

I got the following error when I tried the above piece of code ":32: error: object java.lang.String is not a value case "StringType" => String ^ " – Tisha May 22 '19 at 07:54
I changed two things, now it should work. Firsts, I add `import org.apache.spark.sql.types._`. Second, the other solution was a solution where `column: List[String]` now works with `column: String`, can you confirm that it now woks for you? – Pablo López Gallego May 22 '19 at 08:10
val schema = Source.fromFile("schema.txt").getLines().toList .flatMap(_.split(",")).map(_.replaceAll("\"", "").split(" ")) .map(x => StructField(x(0), getType(x(1)), true)) .......... For this piece of code what is the "Source" mentioned after val schema. I have taken this from the link "How to create a Schema file in Spark" as mentioned in the comment. ...... Bcoz I am unable to use source.from file as I am using HDFS – Tisha May 22 '19 at 11:36
I want the columns to be passed through a configuration file. You have mentioned columns as val columns = "Name String,ID String,Address String,City String" I want a path to be mentioned at its place which has the schema for the dataframe – Tisha May 22 '19 at 11:53
`val schema = Source.fromFile("schema.txt")` means your source, for example: `val schema = spark.read.format("your-file-format-like-csv-or-text").load("your-path/"+"your-file")` – Pablo López Gallego May 22 '19 at 12:53
I am getting this error :32: error: not found: value Source ............while I give this code val schema = Source.fromFile("/hdfs/cbt/dfgdfgdf_fsdfg/data/configfile").getLines().toList.flatMap(_.split(",")).map(_.replaceAll("\"", "").split(" ")).map(x => StructField(x(0), getType(x(1)), true)) – Tisha May 22 '19 at 13:26

How to add the schema in a dataframe from a config file

1 Answers1