-1

Given a string like "The apple fell from a tree", how do I split it such that each word has the line of text appended to it such that I get an RDD of strings that would look like:

"The | The apple fell from a tree"
"apple | The apple fell from a tree"
"fell | The apple fell from a tree"
"from | The apple fell from a tree"
"a | The apple fell from a tree"
"tree | The apple fell from a tree"

This would allow me to keep track of where the word came from.

Here is what I wrote (the relevant parts)

var inputPath = /path/to/file.txt // Some txt file
var input = sc.textFile(inputPath) // RDD of lines of text
var words = input.flatMap(line => line.split(" ").foreach(word => word.concat(" | " + line)) 

This code example does not work because from what I understand, you cannot traverse within a flatMap more than once? I believe I got an error saying Found: Unit Required: TraversableOnce[?] I am new to Spark, Scala and functional programming. First time writing scala, I am not super worried about performance, or the shortest amount of code etc. I just want something working without having to redesign my implementation. I can always refactor later.

I understand that textFile() is giving me an RDD with Strings that represent each line of the text. The flatMap is splitting up those lines by " ", and since it's a flatMap we get one array, as opposed to a bunch of arrays. Please correct me if I am wrong, or not speaking correctly.

teaguecole
  • 1
  • 1
  • 1
  • 1
    Try changing `foreach` to `map`. `foreach` will not give you any return value, only `Unit`. – Shaido Apr 26 '19 at 03:53

1 Answers1

0

I don't have spark at hand, so cannot confirm just now, but looking at the code and error message, most likely it's just the foreach.

So, quick fix would (likely) be to replace the last line with

input.flatMap(line => line.split(" ").map(word => word.concat(" | " + line))

Explanation:

  • line.split gives you an Array[String], which I believe is an instance Traversable[String]
  • foreach applies a function to each item, but returns a Unit - which means that there are no return value from the call (in practice it's a singleton instance of Unit type, but if it helps you can think of it as a void in Java terms)
  • map also applies a function to each item, but returns a new Traversable (potentially of different concrete type) that contains the updated items.
  • Finally, flatMap is a method that essentially combines map and flatten - i.e. it takes a function that takes an item and returns a Traversable[OtherType], applies the function to each item, and than "flattens" the resulting Traversable[Traversable[OtherType]] by concatenating the inner traversables. So you need to give it String => Traversable[String], but you're passing String => Unit

Refer to Scala Traversable docs for more info.

Similar code on plain list of strings:

scala> List(
    "line1 word1 word2", 
    "line2 word3 word4"
)
    .flatMap(line => line.split(" ").map(word => s"$word | $line"))

res5: List[String] = List(
    line1 | line1 word1 word2,
    word1 | line1 word1 word2,
    word2 | line1 word1 word2,
    line2 | line2 word3 word4,
    word3 | line2 word3 word4,
    word4 | line2 word3 word4
)

And btw, Scala encourages immutability, so you might want to use val instead of var, unless you really want to reassign the value - val is more or less similar to final. In your code example you can safely replace vars with vals for sure.

J0HN
  • 26,063
  • 5
  • 54
  • 85