Spark scala remove columns containing only null values

Question

Is there a way to remove the columns of a spark dataFrame that contain only null values ? (I am using scala and Spark 1.6.2)

At the moment I am doing this:

var validCols: List[String] = List()
for (col <- df_filtered.columns){
  val count = df_filtered
    .select(col)
    .distinct
    .count
  println(col, count)
  if (count >= 2){
    validCols ++= List(col)
  }
}

to build the list of column containing at least two distinct values, and then use it in a select().

Thank you !

Possible duplicate of [remove NULL columns in Spark SQL](https://stackoverflow.com/questions/45324762/remove-null-columns-in-spark-sql) — zero323, Oct 15 '18 at 09:43

Timo Strotmann · Answer 1 · 2018-09-10T10:32:38.560

6

I had the same problem and i came up with a similar solution in Java. In my opinion there is no other way of doing it at the moment.

for (String column:df.columns()){
    long count = df.select(column).distinct().count();

    if(count == 1 && df.select(column).first().isNullAt(0)){
        df = df.drop(column);
    }
}

I'm dropping all columns containing exactly one distinct value and which first value is null. This way I can be sure that i don't drop columns where all values are the same but not null.

edited Sep 10 '18 at 10:32

answered Aug 04 '17 at 08:56

Timo Strotmann

371
2
14

1

Small correction, there is a syntax error of missing curly bracket in for loop. – Ajay Sant Sep 10 '18 at 09:58

score 1 · Answer 2 · answered Mar 26 '19 at 00:21

1

Here's a scala example to remove null columns that only queries that data once (faster):

def removeNullColumns(df:DataFrame): DataFrame = {
    var dfNoNulls = df
    val exprs = df.columns.map((_ -> "count")).toMap
    val cnts = df.agg(exprs).first
    for(c <- df.columns) {
        val uses = cnts.getAs[Long]("count("+c+")")
        if ( uses == 0 ) {
            dfNoNulls = dfNoNulls.drop(c)
        }
    }
    return dfNoNulls
}

answered Mar 26 '19 at 00:21

swdev

2,941
2
25
37

Use of `var` and `return`: not idiomatic Scala. – jwvh Mar 26 '19 at 00:51
2

@jwvh The `return` keyword can easily be removed. Avoiding using a `var` would mean using `.select()` instead of `.drop()` since the latter doesn't support arrays. IMHO, neither change make it any more readable. – swdev Mar 26 '19 at 02:54

score 1 · Answer 3 · answered Sep 10 '21 at 17:23

A more idiomatic version of @swdev answer:

private def removeNullColumns(df:DataFrame): DataFrame = {
  val exprs = df.columns.map((_ -> "count")).toMap
  val cnts = df.agg(exprs).first
  df.columns
    .filter(c => cnts.getAs[Long]("count("+c+")") == 0)
    .foldLeft(df)((df, col) => df.drop(col))
}

score 0 · Answer 4 · answered Mar 10 '20 at 12:27

If the dataframe is of reasonable size, I write it as json then reload it. The dynamic schema will ignore null columns and you'd have a lighter dataframe.

scala snippet:

originalDataFrame.write(tempJsonPath)
val lightDataFrame = spark.read.json(tempJsonPath)

score 0 · Answer 5 · answered Dec 11 '22 at 06:58

0

here's @timo-strotmann solution in pySpark syntax:

for column in df.columns:
    count = df.select(column).distinct().count()
    if count == 1 and df.first()[column] is None:
        df = df.drop(column)

answered Dec 11 '22 at 06:58

Ronen

335
4
11

Spark scala remove columns containing only null values

5 Answers5

Linked

Related