spark select columns by type

Question

I want to have a function to dynamically select spark Dataframe columns by their datatype.

So far, I have created:

object StructTypeHelpers {
  def selectColumnsByType[T <: DataType](schem: StructType):Seq[String] = {
    schem.filter(_.dataType.isInstanceOf[T]).map(_.name)
  }

}

so that a StructTypeHelpers. selectColumnsByType[StringType](df.schema) should work. However, the compiler is warning me that:

abstract type T is unchecked since it is eliminated by erasure

When trying to use:

import scala.reflect.ClassTag
def selectColumnsByType[T <: DataType: ClassTag](schem: StructType):Seq[String]

it fails with

No ClassTag available for T

How can I get it to work and compile without the warning?

Answer will be obvious, just add TypeTag or ClassTag etc https://docs.scala-lang.org/overviews/reflection/typetags-manifests.html as you are not providing information regarding type T in the method. — Pavel, Feb 20 '19 at 08:18
But shouldn't this: `T <: DataType: ClassTag` be just that? This failed with: `No ClassTag available for T` — Georg Heiler, Feb 20 '19 at 08:21
You have to provide TypeTag / ClassTag info implicitly, this is a bit of the pain, but works nicely — Pavel, Feb 20 '19 at 08:24
You mean: https://stackoverflow.com/questions/18136313/abstract-type-pattern-is-unchecked-since-it-is-eliminated-by-erasure/18136667 ? Still it does not really seem to work just yet. — Georg Heiler, Feb 20 '19 at 08:26
No, that's not what I mean, something like def paramInfo[T](x: T)(implicit tag: TypeTag[T]), as per my first link, its really should work, sorry, should of put this as the answer :) — Pavel, Feb 20 '19 at 08:29
`def selectColumnsByType[T <: DataType](schem: StructType)(implicit tag: TypeTag[T]):Seq[String] = { schem.filter(_.dataType.isInstanceOf[T]).map(_.name) }` would incorporate your advice, but still yields the same warning. — Georg Heiler, Feb 20 '19 at 08:32
try something like, I haven't had chance to check if its works, but you can play with typeInfor at run time etc: object StructTypeHelpers { def selectColumnsByType[T <: DataType](schem: StructType)(implicit tag:TypeTag[T]):Seq[String] = { schem.filter(_.dataType.typeName == tag.tpe.typeSymbol.fullName).map(_.name) } } — Pavel, Feb 20 '19 at 09:07

score 11 · Accepted Answer · answered Feb 20 '19 at 08:50

11

The idea is to filter only the columns that have the type that you want and then do select.

val df  =  Seq(
  (1, 2, "hello")
).toDF("id", "count", "name")

import org.apache.spark.sql.functions.col
def selectByType(colType: DataType, df: DataFrame) = {

  val cols = df.schema.toList
    .filter(x => x.dataType == colType)
    .map(c => col(c.name))
  df.select(cols:_*)

}
val res = selectByType(IntegerType, df)

answered Feb 20 '19 at 08:50

firsni

856
6
12

But this solution only supports equals, not: `isInstanceOf[columnDataType]`. – Georg Heiler Feb 20 '19 at 08:57
2

why do you want to use asInstanceOf ? – firsni Feb 20 '19 at 08:58
Isn't that safer regarding type hierarchy? But you are probably correct that for the primitive types spark supports it doesn't make a difference. – Georg Heiler Feb 20 '19 at 09:15
Does anyone know how this can be done in pyspark? – Nikunj Kakadiya Mar 05 '21 at 15:04
The particular use case that I have benefits from `isInstanceOf[]` because all precisions of `DecimalType` get included that way. This isn't the normal situation; I'm sure. – wilbur4321 Mar 26 '22 at 05:12

score 1 · Answer 2 · answered Mar 26 '22 at 05:42

A literal answer, helped by How to know if an object is an instance of a TypeTag's type?, would be this:

var x = spark.table(...)

import org.apache.spark.sql.types._
import scala.reflect.{ClassTag, classTag}
def selectColumnsByType[T <: DataType : ClassTag](schema: StructType):Seq[String] = {
  schema.filter(field => classTag[T].runtimeClass.isInstance(field.dataType)).map(_.name)
}

selectColumnsByType[DecimalType](x.schema)

However, this form definitely makes it easier to use:

var x = spark.table(...)

import org.apache.spark.sql.types._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
import scala.reflect.{ClassTag, classTag}
class DataFrameHelpers(val df: DataFrame) {
  def selectColumnsByType[T <: DataType : ClassTag](): DataFrame = {
    val cols = df.schema.filter(field => classTag[T].runtimeClass.isInstance(field.dataType)).map(field => col(field.name))
    df.select(cols:_*)
  }    
}

implicit def toDataFrameHelpers(df: DataFrame): DataFrameHelpers = new DataFrameHelpers(df)

x = x.selectColumnsByType[DecimalType]()

Note, though, as an earlier answer mentioned -- isInstanceOf isn't really appropriate here, although it is helpful if you want to get all DecimalType columns, regardless of precision. Using the more normal method, you could do the following instead, which also lets you specify multiple types!

var x = spark.table(...)

import org.apache.spark.sql.types._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
class DataFrameHelpers(val df: DataFrame) {
  def selectColumnsByType(dt: DataType*): DataFrame = {
    val cols = df.schema.filter(field => dt.exists(_ == field.dataType)).map(field => col(field.name))
    df.select(cols:_*)
  }    
}

implicit def toDataFrameHelpers(df: DataFrame): DataFrameHelpers = new DataFrameHelpers(df)

x = x.selectColumnsByType(ShortType, DecimalType(38,18))

spark select columns by type

2 Answers2