6

I want to have a function to dynamically select spark Dataframe columns by their datatype.

So far, I have created:

object StructTypeHelpers {
  def selectColumnsByType[T <: DataType](schem: StructType):Seq[String] = {
    schem.filter(_.dataType.isInstanceOf[T]).map(_.name)
  }

}

so that a StructTypeHelpers. selectColumnsByType[StringType](df.schema) should work. However, the compiler is warning me that:

abstract type T is unchecked since it is eliminated by erasure

When trying to use:

import scala.reflect.ClassTag
def selectColumnsByType[T <: DataType: ClassTag](schem: StructType):Seq[String]

it fails with

No ClassTag available for T

How can I get it to work and compile without the warning?

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • Answer will be obvious, just add TypeTag or ClassTag etc https://docs.scala-lang.org/overviews/reflection/typetags-manifests.html as you are not providing information regarding type T in the method. – Pavel Feb 20 '19 at 08:18
  • But shouldn't this: `T <: DataType: ClassTag` be just that? This failed with: `No ClassTag available for T` – Georg Heiler Feb 20 '19 at 08:21
  • You have to provide TypeTag / ClassTag info implicitly, this is a bit of the pain, but works nicely – Pavel Feb 20 '19 at 08:24
  • You mean: https://stackoverflow.com/questions/18136313/abstract-type-pattern-is-unchecked-since-it-is-eliminated-by-erasure/18136667 ? Still it does not really seem to work just yet. – Georg Heiler Feb 20 '19 at 08:26
  • No, that's not what I mean, something like def paramInfo[T](x: T)(implicit tag: TypeTag[T]), as per my first link, its really should work, sorry, should of put this as the answer :) – Pavel Feb 20 '19 at 08:29
  • `def selectColumnsByType[T <: DataType](schem: StructType)(implicit tag: TypeTag[T]):Seq[String] = { schem.filter(_.dataType.isInstanceOf[T]).map(_.name) }` would incorporate your advice, but still yields the same warning. – Georg Heiler Feb 20 '19 at 08:32
  • try something like, I haven't had chance to check if its works, but you can play with typeInfor at run time etc: object StructTypeHelpers { def selectColumnsByType[T <: DataType](schem: StructType)(implicit tag:TypeTag[T]):Seq[String] = { schem.filter(_.dataType.typeName == tag.tpe.typeSymbol.fullName).map(_.name) } } – Pavel Feb 20 '19 at 09:07

2 Answers2

11

The idea is to filter only the columns that have the type that you want and then do select.

val df  =  Seq(
  (1, 2, "hello")
).toDF("id", "count", "name")

import org.apache.spark.sql.functions.col
def selectByType(colType: DataType, df: DataFrame) = {

  val cols = df.schema.toList
    .filter(x => x.dataType == colType)
    .map(c => col(c.name))
  df.select(cols:_*)

}
val res = selectByType(IntegerType, df)
firsni
  • 856
  • 6
  • 12
1

A literal answer, helped by How to know if an object is an instance of a TypeTag's type?, would be this:

var x = spark.table(...)

import org.apache.spark.sql.types._
import scala.reflect.{ClassTag, classTag}
def selectColumnsByType[T <: DataType : ClassTag](schema: StructType):Seq[String] = {
  schema.filter(field => classTag[T].runtimeClass.isInstance(field.dataType)).map(_.name)
}

selectColumnsByType[DecimalType](x.schema)

However, this form definitely makes it easier to use:

var x = spark.table(...)

import org.apache.spark.sql.types._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
import scala.reflect.{ClassTag, classTag}
class DataFrameHelpers(val df: DataFrame) {
  def selectColumnsByType[T <: DataType : ClassTag](): DataFrame = {
    val cols = df.schema.filter(field => classTag[T].runtimeClass.isInstance(field.dataType)).map(field => col(field.name))
    df.select(cols:_*)
  }    
}

implicit def toDataFrameHelpers(df: DataFrame): DataFrameHelpers = new DataFrameHelpers(df)

x = x.selectColumnsByType[DecimalType]()

Note, though, as an earlier answer mentioned -- isInstanceOf isn't really appropriate here, although it is helpful if you want to get all DecimalType columns, regardless of precision. Using the more normal method, you could do the following instead, which also lets you specify multiple types!

var x = spark.table(...)

import org.apache.spark.sql.types._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
class DataFrameHelpers(val df: DataFrame) {
  def selectColumnsByType(dt: DataType*): DataFrame = {
    val cols = df.schema.filter(field => dt.exists(_ == field.dataType)).map(field => col(field.name))
    df.select(cols:_*)
  }    
}

implicit def toDataFrameHelpers(df: DataFrame): DataFrameHelpers = new DataFrameHelpers(df)

x = x.selectColumnsByType(ShortType, DecimalType(38,18))
wilbur4321
  • 855
  • 6
  • 10