Given a string like "The apple fell from a tree", how do I split it such that each word has the line of text appended to it such that I get an RDD of strings that would look like:
"The | The apple fell from a tree"
"apple | The apple fell from a tree"
"fell | The apple fell from a tree"
"from | The apple fell from a tree"
"a | The apple fell from a tree"
"tree | The apple fell from a tree"
This would allow me to keep track of where the word came from.
Here is what I wrote (the relevant parts)
var inputPath = /path/to/file.txt // Some txt file
var input = sc.textFile(inputPath) // RDD of lines of text
var words = input.flatMap(line => line.split(" ").foreach(word => word.concat(" | " + line))
This code example does not work because from what I understand, you cannot traverse within a flatMap more than once? I believe I got an error saying Found: Unit Required: TraversableOnce[?]
I am new to Spark, Scala and functional programming. First time writing scala, I am not super worried about performance, or the shortest amount of code etc. I just want something working without having to redesign my implementation. I can always refactor later.
I understand that textFile() is giving me an RDD with Strings that represent each line of the text. The flatMap is splitting up those lines by " ", and since it's a flatMap we get one array, as opposed to a bunch of arrays. Please correct me if I am wrong, or not speaking correctly.