How to run parallel clustering using Amazon EMR / Spark from files in a S3

Question

I have 200,000 points in an 1000-dimensional space.

If I load all these points using sc.textFile and exhaustively calculated the distance between each point, how can I do it in a parallel manner? Will Spark automatically parallelize the work for me?

A good algorothm would *avoid* pairwise distaces. – Has QUIT--Anony-Mousse Apr 18 '16 at 21:33 — Has QUIT--Anony-Mousse, Apr 18 '16 at 21:33

score 0 · Answer 1 · answered Apr 18 '16 at 19:11

0

Yes, Spark automatically parallelizes the work if you use it properly. Here's the spark introduction guide to get started.

To your use case, are you really looking to calculate the distance between all points? Calculating 40 billion numbers will be quite expensive. If you really wanted to do this, you'd likely want to use the cartesian function which will return a RDD of all pairs of the input data (eg 40 billion). Then you could calculate the distance on each pair with a map function.

answered Apr 18 '16 at 19:11

David

11,245
3
41
46

No, this is not a real problem -- I am just thinking of some problem that could be paralellized. – Rodrigo Stv Apr 18 '16 at 20:51
To parallelize work, distributed frameworks make the assumption that each record (or row) can be processed independently of all other records. If this is not true, you'll have to modify the data such that it is true. In your example, consider the starting point of having a dataframe of 200k points. You can't find a distance with just one record (or 1 point). So you would need to explode this data frame to 40B rows which contain all possible pairs of points. Then, you'd be able to process each row independently of all others. – David Apr 18 '16 at 21:07

How to run parallel clustering using Amazon EMR / Spark from files in a S3

1 Answers1