Given a column like so:
+----------+
| firstName|
+----------+
| bob |
| jack |
| bob |
+----------+
I want to output the frequency distribution in the following format: bob -> 2. I'm trying to create a UDF in Scala/ Spark Data frames. The end goal is that given a dataframe, I can output the frequency distribution of each column in the dataframe.
def freqDist(col: Column): Column = {
//need help with this part
}
val freqDist = df.map(ct => freqDist(col(ct._1))).toList
//iterate through each column and compute the freqDist like so
New Approach:
val freqDist = df.dtypes.map(ct => {df.select(ct._1).groupBy(col(ct._1)).count()})
This gives me the freqDist of each column but the type of freqDist is now an Array of dataframes. How can I aggregate all dataframes into one?