1

I have a Spark dataframe which I want to get the statistics

stats_df = df.describe(['mycol'])
stats_df.show()
+-------+------------------+
|summary|             mycol|
+-------+------------------+
|  count|               300|
|   mean|              2243|
| stddev|  319.419860456123|
|    min|              1400|
|    max|              3100|
+-------+------------------+

How do I extract the values of min and max in mycol using the summary min max column values? How do I do it by number index?

zero323
  • 322,348
  • 103
  • 959
  • 935
oikonomiyaki
  • 7,691
  • 15
  • 62
  • 101
  • I've already answered this question [here](http://stackoverflow.com/questions/35272086/spark-1-6-filtering-dataframes-generated-by-describe) – eliasah Jul 27 '16 at 06:28
  • @eliasah Can please provide a Python version solution here? I have difficulty translating your solution, as I don't know Scala. – oikonomiyaki Jul 27 '16 at 06:37
  • Ok I've written an answer below ! If it solves your problem, please accept and upvote :) – eliasah Jul 27 '16 at 06:56

2 Answers2

3

You could easily assign a variable from a select on that dataframe.

x = stats_df.select('mycol').where('summary' == 'min')
BMac
  • 173
  • 1
  • 1
  • 10
2

Ok let's consider the following example :

from pyspark.sql.functions import rand, randn
df = sqlContext.range(1, 1000).toDF('mycol')
df.describe().show()
# +-------+-----------------+
# |summary|            mycol|
# +-------+-----------------+
# |  count|              999|
# |   mean|            500.0|
# | stddev|288.5307609250702|
# |    min|                1|
# |    max|              999|
# +-------+-----------------+

If you want to access the row concerning stddev, per example, you'll just need to convert it into an RDD, collect it and convert it into a dictionary as following :

stats = dict(df.describe().map(lambda r : (r.summary,r.mycol)).collect())
print(stats['stddev'])
# 288.5307609250702
eliasah
  • 39,588
  • 11
  • 124
  • 154
  • Useful map but you'll need to convert the describe dataframe to rdd: dict(df.describe().rdd.map(lambda r: (r.summary, r.High)).collect()) – Rony Armon Dec 26 '22 at 08:07