Frequency Distribution of a Column

Asked Jul 04 '16 at 21:40

Active Dec 21 '16 at 08:41

Viewed 1,165 times

Given a column like so:

+----------+
| firstName|
+----------+
| bob      |
| jack     |
| bob      |
+----------+

I want to output the frequency distribution in the following format: bob -> 2. I'm trying to create a UDF in Scala/ Spark Data frames. The end goal is that given a dataframe, I can output the frequency distribution of each column in the dataframe.

def freqDist(col: Column): Column = {
 //need help with this part
}

val freqDist = df.map(ct => freqDist(col(ct._1))).toList
//iterate through each column and compute the freqDist like so

New Approach:

val freqDist = df.dtypes.map(ct => {df.select(ct._1).groupBy(col(ct._1)).count()})

This gives me the freqDist of each column but the type of freqDist is now an Array of dataframes. How can I aggregate all dataframes into one?

edited Dec 21 '16 at 08:41

Gia Duong Duc Minh

1,319
5
15
30

asked Jul 04 '16 at 21:40

jojo

3

StackOverflow is not for other people to write your code for you. Show the code that you have written so far, give an exact definition of the problem with your code, and then ask a concrete questing with a limited scope. – Madoc Jul 04 '16 at 22:45
Possible duplicate of [How to get Histogram of all columns in a large CSV / RDD\[Array\[double\]\] using Apache Spark Scala?](http://stackoverflow.com/questions/33251427/how-to-get-histogram-of-all-columns-in-a-large-csv-rddarraydouble-using-ap) – marios Jul 05 '16 at 00:36

Frequency Distribution of a Column

0 Answers0