Csv custom schema in spark

Question

I have a csv file

1577,true,false,false,false,true

I tried to load the csv file with custom schema,

val customSchema = StructType(Array(
      StructField("id", StringType, nullable = false),
      StructField("flag1", BooleanType, nullable = false),
      StructField("flag2", BooleanType, nullable = false),
      StructField("flag3", BooleanType, nullable = false),
      StructField("flag4", BooleanType, nullable = false),
    StructField("flag6", BooleanType, nullable = false))

    )
    val df =
      spark.read.schema(customSchema).option("header","false").
     option("inferSchema","false").csv("mycsv.csv")

But nullable properly of schema is not changing as expected.

df.printSchema
root
 |-- id: string (nullable = true)
 |-- flag1: boolean (nullable = true)
 |-- flag2: boolean (nullable = true)
 |-- flag3: boolean (nullable = true)
 |-- flag4: boolean (nullable = true)
 |-- flag6: boolean (nullable = true)

i think you need to cast as well .https://stackoverflow.com/questions/40526208/about-how-to-create-a-custom-org-apache-spark-sql-types-structtype-schema-object — Indrajit Swain, Apr 09 '18 at 07:41
Also see this one: https://stackoverflow.com/questions/39917075/pyspark-structfield-false-always-returns-nullable-true-instead-of — Shaido, Apr 09 '18 at 08:18
thanks for the help. I got a workaround from here https://stackoverflow.com/questions/47443483/how-do-i-apply-schema-with-nullable-false-to-json-reading?rq=1 — John, Apr 09 '18 at 09:12

score 0 · Answer 1 · answered Apr 09 '18 at 09:17

0

Please check the below urls for details

Spark DataFrame Schema Nullable Fields

How do I apply schema with nullable = false to json reading

Workaround

val rowDF = spark.read.textFile("mycsv.csv")
    val df= spark.read.schema(customSchema).csv(rowDF)
    df.printSchema()

answered Apr 09 '18 at 09:17

John

1,531
6
18
30

You might need the databricks CSV library or Spark >= 2.0 to do that – Victor Apr 09 '18 at 09:19

score 0 · Answer 2 · answered Apr 09 '18 at 10:52

// Create an RDD val rowRDD1 = spark.sparkContext.textFile("../yourfile.csv")

// The schema is encoded in a string val schemaString = "id flag1 flag2 flag3 flag4 flag5 flag6"

// Generate the schema based on the string of schema val fields = schemaString.split(" "). map(fieldName => StructField(fieldName, StringType, nullable = true))

val schema = StructType(fields)

// Convert records of the RDD (rowRDD1 ) to Rows val rowRDD = rowRDD. map(_.split(",")). map(attributes => Row(attributes(0), attributes(1),..,..))

// Apply the schema to the RDD val rowDF = spark.createDataFrame(rowRDD, schema)

Csv custom schema in spark

2 Answers2