0

Given a column like so:

+----------+
| firstName|
+----------+
| bob      |
| jack     |
| bob      |
+----------+

I want to output the frequency distribution in the following format: bob -> 2. I'm trying to create a UDF in Scala/ Spark Data frames. The end goal is that given a dataframe, I can output the frequency distribution of each column in the dataframe.

def freqDist(col: Column): Column = {
 //need help with this part
}

val freqDist = df.map(ct => freqDist(col(ct._1))).toList
//iterate through each column and compute the freqDist like so

New Approach:

val freqDist = df.dtypes.map(ct => {df.select(ct._1).groupBy(col(ct._1)).count()})

This gives me the freqDist of each column but the type of freqDist is now an Array of dataframes. How can I aggregate all dataframes into one?

Gia Duong Duc Minh
  • 1,319
  • 5
  • 15
  • 30
jojo
  • 73
  • 2
  • 10
  • 3
    StackOverflow is not for other people to write your code for you. Show the code that you have written so far, give an exact definition of the problem with your code, and then ask a concrete questing with a limited scope. – Madoc Jul 04 '16 at 22:45
  • Possible duplicate of [How to get Histogram of all columns in a large CSV / RDD\[Array\[double\]\] using Apache Spark Scala?](http://stackoverflow.com/questions/33251427/how-to-get-histogram-of-all-columns-in-a-large-csv-rddarraydouble-using-ap) – marios Jul 05 '16 at 00:36

0 Answers0