0

I have a file like this:

1,ITEM_001,CAT_01,true,2,50,4,0,false,2019-01-01,2019-01-28,true
1,ITEM_001,CAT_01,true,2,60,4,0,false,2019-01-29,2019-12-32,true
1,ITEM_002,CAT_02,true,2,50,"","",false,2019-01-01,2019-11-22,true

I do not want to infer schema in case it is big. I tried to map to a case class record, but for some reason the things were not ok

So, I am doing the following:

val dfPG = spark.read.format("csv")
                .option("header", "true")
                .option("inferSchema", "false")
                .option("nullValue", "")
                .load("/FileStore/tables/SO_QQQ.txt") 

and setting the fields explicitly:

val dfPG2 =
      dfPG
         .map {r => (r.getString(0).toLong, r.getString(1), r.getString(2), r.getString(3).toBoolean, r.getString(4).toInt, r.getString(5).toInt,
             r.getString(6)     //r.getString(6).toInt
            ) }

I cannot seem to process a null value and also set to type of Integer. Where there is a null value I am getting String, but I want Int, but I get an error on every possible approach.

See the //. The below fails with null exception and for some reason I cannot formulate a check logic here? Is there an easier way.

r.getString(6).toInt

I must be over-complicating and/or missing something.

Just to add, when loading via Seq to dataframe with Option it all works fine. It's the file input.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83

1 Answers1

1

That's just no correct way of doing things. Instead of mapping things by hand (both inefficient and extremely error prone) you should define a schema for your data

import org.apache.spark.sql.types._

val schema = StructType(Seq(
  StructField(...),
  StructField(...),
  StructField(...),
  StructField(...),
  StructField(...),
  StructField(...),
  StructField("your_integer_field", IntegerType, true),
  ...
))

and provide it to the reader:

val dfPG = spark.read.format("csv")
  .schema(schema)
  ...
  .load("/FileStore/tables/SO_QQQ.txt") 
  • I am going to try this, I have seen this approach in the past and used it on RDDs. Will get back. I mean my approach. It's just the null thing. Excited. – thebluephantom Mar 06 '19 at 19:22
  • Great, but I am wondering if I could have worked around the issue the other way? I think os, but was not sure, but may be not. Anyway, that is great. – thebluephantom Mar 06 '19 at 19:32
  • https://stackoverflow.com/questions/41705602/spark-dataframe-schema-nullable-fields. In Scala all nullable! – thebluephantom Mar 06 '19 at 19:57
  • Seq(...).toDF does show false, true on schema, as I always knew BTW, so very interesting. – thebluephantom Mar 06 '19 at 20:09