Spark Dataframe - Encoder

Question

I am new to Scala and Spark.

I am trying to use encoder to read a file from Spark and then convert to a java/scala object.

The first step to read the file applying a schema and encoding using as works fine.

Then I use that dataset/dataframe to do a simple map operation, but if I try to print the schema on the resultant dataset/dataframe it doesn't print any columns.

Also, when i first read the file, i don't map age field in Person class, just to calculate it in the map function to try out - but I don't see that age not mapped to the data frame using Person at all.

Data in Person.txt:

firstName,lastName,dob
ABC, XYZ, 01/01/2019
CDE, FGH, 01/02/2020

The below is the code:

object EncoderExample extends App {
  val sparkSession = SparkSession.builder().appName("EncoderExample").master("local").getOrCreate();

  case class Person(firstName: String, lastName: String, dob: String,var age: Int = 10)
  implicit val encoder = Encoders.bean[Person](classOf[Person])
  val personDf = sparkSession.read.option("header","true").option("inferSchema","true").csv("Person.txt").as(encoder)

  personDf.printSchema()
  personDf.show()

  val calAge = personDf.map(p => {
    p.age = Year.now().getValue - p.dob.substring(6).toInt
    println(p.age)
    p
  } )//.toDF()//.as(encoder)

  print("*********Person DF Schema after age calculation: ")
  calAge.printSchema()

  //calAge.show
}

A **case class** is not a java bean. You only need to do this: `import sparkSession.implcits._` and then `sparkSession.read.option("header","true").option("inferSchema","true").csv("Person.txt").as[Person]`, that is explained in the [getting started page of the documentation](https://spark.apache.org/docs/latest/sql-getting-started.html#creating-datasets) - Also, a `print` inside a `map` is discouraged and will not work as expected on a real distributed deployment - Finally, **case classes** should be `final` - it would be good to take some time to read and lear a little bit more before doing. — Luis Miguel Mejía Suárez, Aug 18 '20 at 05:17

score 0 · Answer 1 · answered Aug 18 '20 at 07:29

package spark

import java.text.SimpleDateFormat
import java.util.Calendar

import org.apache.spark.sql.{SparkSession}
import org.apache.spark.sql.functions._

case class Person(firstName: String, lastName: String, dob: String, age: Long)

object CalcAge extends App {

  val spark = SparkSession.builder()
    .master("local")
    .appName("DataFrame-example")
    .getOrCreate()

  import spark.implicits._

  val sourceDF = Seq(
    ("ABC", "XYZ", "01/01/2019"),
    ("CDE", "FGH", "01/02/2020")
  ).toDF("firstName","lastName","dob")

  sourceDF.printSchema
  //  root
  //  |-- firstName: string (nullable = true)
  //  |-- lastName: string (nullable = true)
  //  |-- dob: string (nullable = true)

  sourceDF.show(false)
  //  +---------+--------+----------+
  //  |firstName|lastName|dob       |
  //  +---------+--------+----------+
  //  |ABC      |XYZ     |01/01/2019|
  //  |CDE      |FGH     |01/02/2020|
  //  +---------+--------+----------+


  def getCurrentYear: Long = {

    val today:java.util.Date = Calendar.getInstance.getTime
    val timeFormat = new SimpleDateFormat("yyyy")
    timeFormat.format(today).toLong

  }

  val ageUDF = udf((d1: String) => {

    val year = d1.split("/").reverse.head.toLong
    val yearNow = getCurrentYear
    yearNow - year
  })


  val df = sourceDF
    .withColumn("age", ageUDF('dob))
  df.printSchema
  //  root
  //  |-- firstName: string (nullable = true)
  //  |-- lastName: string (nullable = true)
  //  |-- dob: string (nullable = true)
  //  |-- age: long (nullable = false)

  df.show(false)
  //  +---------+--------+----------+---+
  //  |firstName|lastName|dob       |age|
  //  +---------+--------+----------+---+
  //  |ABC      |XYZ     |01/01/2019|1  |
  //  |CDE      |FGH     |01/02/2020|0  |
  //  +---------+--------+----------+---+

  val person = df.as[Person].collectAsList()
  //  person: java.util.List[Person] = [Person(ABC,XYZ,01/01/2019,1), Person(CDE,FGH,01/02/2020,0)]
  println(person)



}

Spark Dataframe - Encoder

1 Answers1