How to distribute processing to find waldos in csv using spark scala in a clustered environment?

Question

I have a spark cluster of 1 master, 3 workers. I have a simple, but gigantic CSV file like this:

FirstName, Age
Waldo, 5
Emily, 7
John, 4
Waldo, 9
Amy, 2
Kevin, 4
...

I want to get all the records where FirstName is waldo. Normally on one machine and local mode, if I have an RDD, I can do a ".parallelize()" to get an RDD, then assuming the variable is "mydata", I can do:

mydata.foreach(x => // check if the first row's first column value contains "Waldo", if so, then print it)

From my understanding, using the above method, every spark slave would have to perform that iteration on the entire gigantic csv to get the result instead of each machine processing a third of the file (correct me if I am wrong).

Problem is, if I have 3 machines, how do I make it so that:

The csv file is broken up into 3 different "sets" to each of the workers so that each worker is working with a much smaller file (1/3rd of the original)
Each worker processes it, and finds all the "FirstName=Waldo"s"
The resulting list of "Waldo" records are reported back to me in a way that actually takes advantage of the cluster.

Possible duplicate of [How to find a specific record in spark in cluster mode using scala?](http://stackoverflow.com/questions/33042451/how-to-find-a-specific-record-in-spark-in-cluster-mode-using-scala) — zero323, Oct 09 '15 at 16:24
You're wrong. `foreach` on RDD can access only its own part of the dataset. While spark is not the best tool here it can be done: see http://stackoverflow.com/a/31544650/1560062 Finally please don't post duplicates. After few years here you should know better. — zero323, Oct 09 '15 at 16:31
@zero323, what is a better tool here that spark for this purpose? I know there are databases, but I don't want to duplicate data into a database/index it to get to this result. — Rolando, Oct 09 '15 at 17:49
Index is not a hard requirement for efficient search. Techniques like sort dimension, hashing or partitioning can have similar performance. Spark is a batch processing tool so it is simply not designed for a single record access. — zero323, Oct 09 '15 at 21:39
I think of spark being useful because you can split up a big dataset into a separate partition for each machine to handle. I can't think of any other big data solution that is good for efficient search (without having to "load into a separate database or datastore"). Do you know of any good alternative to spark for this? — Rolando, Oct 09 '15 at 21:52
Any modern database can do it and using Spark it is simply a brute force approach. You can take a look at https://github.com/amplab/spark-indexedrdd — zero323, Oct 09 '15 at 22:23
Problem is my files are raw files in HDFS, I do not want to load data into something else, which is why I chose spark. Don't know if there is anything better. — Rolando, Oct 10 '15 at 05:19
You could use bloom filters but honestly if you want to use raw data, stored in a inefficient format, without indexing or using external storage then you pay a price :) — zero323, Oct 10 '15 at 09:14

score 0 · Accepted Answer · answered Oct 12 '15 at 03:25

Mmm, lot of points to make here. First, if you are using HDFS, you file is already partitioned, it is a distributed file system. You probably even have the data replicated 3 times, as that is the default (depends on your config though).

Second, Spark will indeed make use of this partitioning when you told it to load data, and will process chunks locally. Shuffling data around is only required when you want to, for instance, re-partition you data by some criteria, like keys in a key/value pair, etc.

Third, Spark is indeed great for doing batch processing and some datamining if you don't want to structure a database or don't have predefined access patterns. In short, for what you seem to need. You don't even need to write and compile code since you can run a Spark Shell and try with a few lines. I do recommend you to look at the docs, since you don't seem to have a clear grasp of the platform yet.

Fourth, I don't have an IDE or anything here, but the code you need should be something line this (sort of pseudocode, but should be VERY close):

sc
  .textFile("my_hdfs_path")
  .keyBy(_.split('\t')(0))
  .filter(_._1 == "Waldo")
  .map(_._2)
  .saveAsTextFile("my_hdfs_out")

if not too big, you can also use collect to bring all results to the driver location instead of saving to file, but after that you are back in a single machine. Hope it helps!

Code needs to be written since spark-shell is not taking advantage of a "clustered spark setup" right? Data could potentially be huge... don't want to collect. Do you think this is an appropriate use of spark? — Rolando, Oct 12 '15 at 03:49
It does take advantage of it. And it is an appropriate use. You can also pack it in a jar to run it more than once. — Daniel Langdon, Oct 12 '15 at 17:30

How to distribute processing to find waldos in csv using spark scala in a clustered environment?

1 Answers1