Finding Median on Column in Dataframe

Asked May 24 '17 at 06:26

Active Jun 30 '18 at 05:53

Viewed 1,020 times

I have a dataframe with column Salary. I have to find out the median on this column using Spark SQL and SCALA. Spark version 1.6.0 and Scala version is 2.10.5.

I have registered Dataframe as table and fired below query.

import org.apache.spark.mllib.random.RandomRDDs

sqlContext.sql("SELECT percentile_approx(salary, 0.5) FROM employee").show()

The Data frame is created from CSV and has rows(Header + data rows). Data rows are odd in number. While firing above query it is giving me result in decimal values.

Data looks like this(from CSV):

salary; name;    job;    gender

1000;    AA;    private;  M

2000;    BB;    public;   M

Please help me to find the correct solution for this. Thanks in advance.

edited Jun 30 '18 at 05:53

tourist

4,165
6
25
47

asked May 24 '17 at 06:26

codelover

Did you try this https://stackoverflow.com/questions/34519549/how-to-calculate-median-in-spark-sqlcontext-for-column-of-data-type-double – Shankar May 24 '17 at 07:32
what is the count in dataframe if you skip headers? – koiralo May 24 '17 at 07:50
please share example data. – mtoto May 24 '17 at 08:08
The count is odd after skipping header. Please check the data above. – codelover May 24 '17 at 18:53
The issue is Median is coming out as decimal value even if the count is odd and no columns values are in decimal. – codelover May 24 '17 at 19:00

Finding Median on Column in Dataframe

0 Answers0