I have a spark cluster of 1 master, 3 workers. I have a simple, but gigantic CSV file like this:
FirstName, Age
Waldo, 5
Emily, 7
John, 4
Waldo, 9
Amy, 2
Kevin, 4
...
I want to get all the records where FirstName is waldo. Normally on one machine and local mode, if I have an RDD, I can do a ".parallelize()" to get an RDD, then assuming the variable is "mydata", I can do:
mydata.foreach(x => // check if the first row's first column value contains "Waldo", if so, then print it)
From my understanding, using the above method, every spark slave would have to perform that iteration on the entire gigantic csv to get the result instead of each machine processing a third of the file (correct me if I am wrong).
Problem is, if I have 3 machines, how do I make it so that:
- The csv file is broken up into 3 different "sets" to each of the workers so that each worker is working with a much smaller file (1/3rd of the original)
- Each worker processes it, and finds all the "FirstName=Waldo"s"
- The resulting list of "Waldo" records are reported back to me in a way that actually takes advantage of the cluster.