1

:)

I would like to refer the variable name (just the name, no the value) in a case class.

Next it is a very simplified example:

case class Person(name: String, age: Int)
val schema = Encoders.products[Person].schema
val jack = Person("name", 20)

override def method[Person](df: DataFrame) : DataFrame = {        
  df.withColumn("json", from_json(col("column_value"), schema))
    .select("json.*")
    .withColumn(jack.name, trim(col(jack.name)))
    .withColumn(jack.age, col(jack.age) + 2)
}

Of course, jack.name will return the value, which is a String and works well to my propouse. But as you can already imagine, jack.age will give me the value, no "age".

So far, I get this, which I think is a really ugly and inefficient solution:

val onlyNames: Seq[String] = schema.map(_.name) 
...
.withColumn(...)
.withColumn(onlyNames(2), col(onlyNames(2)) + 2)

Versions: Spark 2.3.0 // Scala 2.11.8

Borja
  • 194
  • 1
  • 3
  • 17
  • Scala 2.13 has `.productElementNames: Iterator[String]` but since you are on Spark you are virtually required to use Scala 2.11... What kind of solution is acceptable to you? It would be possible to extract names using runtime reflection (though it won't work pretty for generic types) or you can use something lhat Shapeless or Magnolia to generate the code for you (some people find it easier, some find it harder). Any other library would use either of these under the hood, so what would you prefer? – Mateusz Kubuszok Apr 15 '20 at 14:30
  • Good point, I'm gonna edit the question and add the Scala version. Do you have a solution in both ways? Say both! haha but I rather prefer extract names using runtime reflection. In the worst scenario, I might just declare an object with String variables for the names and schema with StructType...but that is not the Scala's spirit I think, so any inspiration better than that would be helpful – Borja Apr 15 '20 at 15:11

2 Answers2

3

In Scala 2.13 you could use:

val person = Person("John", 23)
(person.productEleementNames zip person.productIterator).foldLeft(dataFrame) {
  case (dataFrame, (name, value)) =>
    dataFrame.withColumn(name, value) // example
}

but since you are on 2.11 or at best on 2.12 because of Spark, you have to use some other way.

One way would be to use runtime reflection:

(person.getClass.getDeclaredFields.map(_.getName) zip person.productIterator).foldLeft(dataFrame) {
  case (dataFrame, (name, value)) =>
    dataFrame.withColumn(name, value) // example
}

this would have a runtime penalty but would not require any dependencies.

Another option would be usage of shapeless or magnolia to calculate the result in compile time (provided that you know the type of what you want to extract field names from).

Shapeless solution is already provided in another question.

Magnolia solution would be something like (disclaimer: not tested if it compiles):

import magnolia._

trait FieldNames[T] {
  def apply(): List[String]
}

object FieldNames {
  def getNames[T](implicit fieldNames: FieldNames[T]): FieldNames[T] = fieldNames

  type Typeclass[T] = FieldNames[T]

  def combine[T](ctx: ReadOnlyCaseClass[Typeclass, T]): FieldNames[T] = () =>
    ctx.parameters.map(_.label).toList
  }

  implicit def gen[T]: FieldNames[T] = macro Magnolia.gen[T]
}

(FieldNames.getNames[Person] zip person.productIterator).foldLeft(dataFrame) {
  case (dataFrame, (name, value)) =>
    dataFrame.withColumn(name, value) // example
}

Compile time reflection requires a bit more effort and assumes that you know the type of value you are working with at compile time, but should be faster at runtime and less error prone.

Long story short which one is better depends on your use case.

Mateusz Kubuszok
  • 24,995
  • 4
  • 42
  • 64
1

I agree with previous answers, there are other alternatives also to extract fields.

Since you are using Spark, you can always simulate column names like this:

case class Person(name: String, age: Int)
val jack = Person("name", 20)

val columnNames = Seq(jack).toDF().columns

println(columnNames.toList)

And for pure Scala 2.11

val fieldNames = classOf[Person].getDeclaredFields.map { f =>
  f.setAccessible(true)
  val res = f.getName
  f.setAccessible(false)
  res
}

println(fieldNames.toList)

Scastie example -> here

And the shapeless example:

import shapeless._
import shapeless.ops.record._

case class Person(name: String, age: Int)
val labelledPerson = LabelledGeneric[Person]
val columnNames = Keys[labelledPerson.Repr].apply.toList.map(_.name)
println(columnNames)

Shapeless and Scala versions:

scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
  "com.github.alexarchambault" %% "argonaut-shapeless" % "6.1"
)

Scastie example -> here

Bob
  • 1,351
  • 11
  • 28