Given the following dataset:
movieID: abgh
movieName: Titanic
reviewer: John Smith
score: 3.5
movieID: adsa
movieName: Jumanji
reviewer:Mary Jo
score: 4.5
...(assume the data is in a single text file where there are always 4 rows representing a entry)
given a small text file where we're trying to use Spark to do some analysis on the dataset to get the average score per movieID, my lecturer suggested the following:
to read the textfile as a RDD
create 2 RDDs of score and movieID using filter, ie
val movieID = RDD1.filter(z=>z.contains("movieID")).map(_.split(":")).map(z=>z(1))
val score = RDD1.filter(z=>z.contains("score")).map(_.split(":")).map(z=>z(1).toFloat)
from (2), zip the 2 RDDs together and I'll get a state of movieId vs score per row.
val zip_rdd = movieID.zip(score)
val mean_score = zip_rdd.mapValues(value=>(value,1)).reduceByKey{case((sumL,countL),(sumR, countR))=>(sumL+sumR, countL+countR)}.mapValues{case(sum,count)=>sum/count}
I was wondering since data is partitioned in Spark, can we guarantee that the data is read in sequence? ie the movieID and score come from the same review?
Thanks for help in advance!
EDIT: in case it wasn't clear, can I be sure that the key/value pairs in zip_rdd
comes from the same review? I'm using a psuedo cluster now (Hortonworks sandbox) but im wondering if anything will change if the data size is scaled up dramatically and I'll end up using a cluster to compute it.
from a Spark newbie.