Spark 1.6: filtering DataFrames generated by describe()

Question

The problem arises when I call describe function on a DataFrame:

val statsDF = myDataFrame.describe()

Calling describe function yields the following output:

statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string]

I can show statsDF normally by calling statsDF.show()

+-------+------------------+
|summary|             count|
+-------+------------------+
|  count|             53173|
|   mean|104.76128862392568|
| stddev|3577.8184333911513|
|    min|                 1|
|    max|            558407|
+-------+------------------+

I would like now to get the standard deviation and the mean from statsDF, but when I am trying to collect the values by doing something like:

val temp = statsDF.where($"summary" === "stddev").collect()

I am getting Task not serializable exception.

I am also facing the same exception when I call:

statsDF.where($"summary" === "stddev").show()

It looks like we cannot filter DataFrames generated by describe() function?

I'm voting this up. It's weird now to be able to filter on the DF created by describe — eliasah, Feb 08 '16 at 14:46

score 5 · Accepted Answer · answered Feb 08 '16 at 14:52

5

I have considered a toy dataset I had containing some health disease data

val stddev_tobacco = rawData.describe().rdd.map{ 
    case r : Row => (r.getAs[String]("summary"),r.get(1))
}.filter(_._1 == "stddev").map(_._2).collect

answered Feb 08 '16 at 14:52

eliasah

39,588
11
124
154

@Rami This should do the job, thought it's a bit silly and naïve... Tell me if this works for you ! – eliasah Feb 08 '16 at 14:53
1

Thanks @eliasah, it is strange that we can't filter theses DF. I will consider maybe to point this problem out to Spark guys. – Rami Feb 08 '16 at 15:45
@zero323 what do you think about this issue ? Should we open a issue on JIRA about it ? – eliasah Feb 08 '16 at 15:45
1

@Rami it is actually strange to me ! That's why I'm asking the big chef about it :) – eliasah Feb 08 '16 at 15:46
1

The big chef is great ;) – Rami Feb 08 '16 at 15:48
This is helpful. Can you suggest what modification is required to make this work for the entire row (when I have multiple numeric columns in original dataset) – Omley Jun 17 '16 at 10:06

oluies · Answer 2 · 2016-03-05T17:14:15.323

You can select from the dataframe:

from pyspark.sql.functions import mean, min, max
df.select([mean('uniform'), min('uniform'), max('uniform')]).show()
+------------------+-------------------+------------------+
|      AVG(uniform)|       MIN(uniform)|      MAX(uniform)|
+------------------+-------------------+------------------+
|0.5215336029384192|0.19657711634539565|0.9970412477032209|
+------------------+-------------------+------------------+

You can also register it as a table and query the table:

val t = x.describe()
t.registerTempTable("dt")

%sql 
select * from dt

score 1 · Answer 3 · answered Aug 24 '16 at 10:36

1

Another option would be to use selectExpr() which also runs optimized, e.g. to obtain the min:

myDataFrame.selectExpr('MIN(count)').head()[0]

answered Aug 24 '16 at 10:36

Boern

7,233
5
55
86

score 1 · Answer 4 · edited Dec 16 '18 at 06:34

1

myDataFrame.describe().filter($"summary"==="stddev").show()

This worked quite nicely on Spark 2.3.0

edited Dec 16 '18 at 06:34

sticky bit

36,626
12
31
42

answered Dec 16 '18 at 05:07

kvk

11
1

Worked on Spark 2.1.0 as well – Nitin Apr 16 '19 at 18:26

Spark 1.6: filtering DataFrames generated by describe()

4 Answers4

Linked