7

Is there a way to remove the columns of a spark dataFrame that contain only null values ? (I am using scala and Spark 1.6.2)

At the moment I am doing this:

var validCols: List[String] = List()
for (col <- df_filtered.columns){
  val count = df_filtered
    .select(col)
    .distinct
    .count
  println(col, count)
  if (count >= 2){
    validCols ++= List(col)
  }
}

to build the list of column containing at least two distinct values, and then use it in a select().

Thank you !

maxk
  • 161
  • 2
  • 7
  • 1
    Possible duplicate of [remove NULL columns in Spark SQL](https://stackoverflow.com/questions/45324762/remove-null-columns-in-spark-sql) – zero323 Oct 15 '18 at 09:43

5 Answers5

6

I had the same problem and i came up with a similar solution in Java. In my opinion there is no other way of doing it at the moment.

for (String column:df.columns()){
    long count = df.select(column).distinct().count();

    if(count == 1 && df.select(column).first().isNullAt(0)){
        df = df.drop(column);
    }
}

I'm dropping all columns containing exactly one distinct value and which first value is null. This way I can be sure that i don't drop columns where all values are the same but not null.

Timo Strotmann
  • 371
  • 2
  • 14
1

Here's a scala example to remove null columns that only queries that data once (faster):

def removeNullColumns(df:DataFrame): DataFrame = {
    var dfNoNulls = df
    val exprs = df.columns.map((_ -> "count")).toMap
    val cnts = df.agg(exprs).first
    for(c <- df.columns) {
        val uses = cnts.getAs[Long]("count("+c+")")
        if ( uses == 0 ) {
            dfNoNulls = dfNoNulls.drop(c)
        }
    }
    return dfNoNulls
}
swdev
  • 2,941
  • 2
  • 25
  • 37
  • Use of `var` and `return`: not idiomatic Scala. – jwvh Mar 26 '19 at 00:51
  • 2
    @jwvh The `return` keyword can easily be removed. Avoiding using a `var` would mean using `.select()` instead of `.drop()` since the latter doesn't support arrays. IMHO, neither change make it any more readable. – swdev Mar 26 '19 at 02:54
1

A more idiomatic version of @swdev answer:

private def removeNullColumns(df:DataFrame): DataFrame = {
  val exprs = df.columns.map((_ -> "count")).toMap
  val cnts = df.agg(exprs).first
  df.columns
    .filter(c => cnts.getAs[Long]("count("+c+")") == 0)
    .foldLeft(df)((df, col) => df.drop(col))
}
ItamarBe
  • 482
  • 1
  • 5
  • 12
0

If the dataframe is of reasonable size, I write it as json then reload it. The dynamic schema will ignore null columns and you'd have a lighter dataframe.

scala snippet:

originalDataFrame.write(tempJsonPath)
val lightDataFrame = spark.read.json(tempJsonPath)
mjalajel
  • 2,171
  • 21
  • 27
0

here's @timo-strotmann solution in pySpark syntax:

for column in df.columns:
    count = df.select(column).distinct().count()
    if count == 1 and df.first()[column] is None:
        df = df.drop(column)
Ronen
  • 335
  • 4
  • 11