0

This is little different than the common word count program. I am trying to get the distinct word count per line.

Input:

Line number one has six words
Line number two has two words

Expected output:

line1 => (Line,1),(number,1),(one,1),(has,1),(six,1),(words,1)
line2 => (Line,1),(number,1),(two,2),(has,1),(words,1)

Can anyone please guide me.

baitmbarek
  • 2,440
  • 4
  • 18
  • 26
ben
  • 1,404
  • 8
  • 25
  • 43

2 Answers2

1

By using Dataframe in built functions explode,split,collect_set,groupBy.

//input data
val df=Seq("Line number one has six words","Line number two has has two words").toDF("input")

scala> :paste
// Entering paste mode (ctrl-D to finish)

df.withColumn("words",explode(split($"input","\\s+"))) //split by space and explode
.groupBy("input","words") //group by on both columns
.count()
.withColumn("line_word_count",struct($"words",$"count")) //create struct
.groupBy("input") //grouping by input data column
.agg(collect_set("line_word_count").alias("line_word_count"))
.show(false)

Result:

+---------------------------------+------------------------------------------------------------------+
|input                            |line_word_count                                                   |
+---------------------------------+------------------------------------------------------------------+
|Line number one has six words    |[[one, 1], [has, 1], [six, 1], [number, 1], [words, 1], [Line, 1]]|
|Line number two has has two words|[[has, 2], [two, 2], [words, 1], [number, 1], [Line, 1]]          |
+---------------------------------+------------------------------------------------------------------+

If you are expecting line numbers then use concat,monotonically_increasing_id functions:

df.withColumn("line",concat(lit("line"),monotonically_increasing_id()+1))
.withColumn("words",explode(split($"input","\\s+"))) 
.groupBy("input","words","line") 
.count() 
.withColumn("line_word_count",struct($"words",$"count")) 
.groupBy("line") 
.agg(collect_set("line_word_count").alias("line_word_count")) 
.show(false)

Result:

+-----+------------------------------------------------------------------+
|line |line_word_count                                                   |
+-----+------------------------------------------------------------------+
|line1|[[one, 1], [has, 1], [six, 1], [words, 1], [number, 1], [Line, 1]]|
|line2|[[has, 2], [two, 2], [number, 1], [words, 1], [Line, 1]]          |
+-----+------------------------------------------------------------------+

Note incase of larger datasets to make it consecutive we need to do .repartition(1).

notNull
  • 30,258
  • 4
  • 35
  • 50
0

Here is another way using RDD API:

val rdd = df.withColumn("output", split($"input", " ")).rdd.map(l => (
                l.getAs[String](0), 
                l.getAs[Seq[String]](1).groupBy(identity).mapValues(_.size).map(identity))
          )

val dfCount = spark.createDataFrame(rdd).toDF("input", "output")

Not a big fan of using UDF, but it can also be done like this:

import org.apache.spark.sql.functions.udf

val mapCount: Seq[String] => Map[String, Integer] = _.groupBy(identity).mapValues(_.size)
val countWordsUdf  = udf(mapCount)

df.withColumn("output", countWordsUdf(split($"input", " "))).show(false)

Gives:

+---------------------------------+------------------------------------------------------------------+
|input                            |output                                                            |
+---------------------------------+------------------------------------------------------------------+
|Line number one has six words    |[number -> 1, Line -> 1, has -> 1, six -> 1, words -> 1, one -> 1]|
|Line number two has has two words|[number -> 1, two -> 2, Line -> 1, has -> 2, words -> 1]          |
+---------------------------------+------------------------------------------------------------------+
blackbishop
  • 30,945
  • 11
  • 55
  • 76