1

Given the following dataset:

movieID: abgh
movieName: Titanic
reviewer: John Smith
score: 3.5

movieID: adsa
movieName: Jumanji
reviewer:Mary Jo
score: 4.5

...(assume the data is in a single text file where there are always 4 rows representing a entry)
given a small text file where we're trying to use Spark to do some analysis on the dataset to get the average score per movieID, my lecturer suggested the following:

  1. to read the textfile as a RDD

  2. create 2 RDDs of score and movieID using filter, ie
    val movieID = RDD1.filter(z=>z.contains("movieID")).map(_.split(":")).map(z=>z(1))
    val score = RDD1.filter(z=>z.contains("score")).map(_.split(":")).map(z=>z(1).toFloat)

  3. from (2), zip the 2 RDDs together and I'll get a state of movieId vs score per row.
    val zip_rdd = movieID.zip(score)
    val mean_score = zip_rdd.mapValues(value=>(value,1)).reduceByKey{case((sumL,countL),(sumR, countR))=>(sumL+sumR, countL+countR)}.mapValues{case(sum,count)=>sum/count}
    I was wondering since data is partitioned in Spark, can we guarantee that the data is read in sequence? ie the movieID and score come from the same review?
    Thanks for help in advance!

EDIT: in case it wasn't clear, can I be sure that the key/value pairs in zip_rdd comes from the same review? I'm using a psuedo cluster now (Hortonworks sandbox) but im wondering if anything will change if the data size is scaled up dramatically and I'll end up using a cluster to compute it.

from a Spark newbie.

XJL
  • 117
  • 1
  • 8

1 Answers1

1

It's fine as reading in from disk preserves order. filter is a narrow transformation. And zip relies on this fact. There are no wide transformations before the zip.

Alternatively you can zipWithIndex and then JOIN if you want on the zipped value in an appropriate manner. This is a narrow transformation, so no issue.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • Thanks for replying. But what if we had stored the file in hdfs? – XJL Sep 24 '20 at 08:23
  • That does not matter. – thebluephantom Sep 24 '20 at 08:23
  • Not sure if I understood hdfs correctly, but had the file been super big, ie gigabytes sized, will it still still ensure order is preserved? – XJL Sep 24 '20 at 08:26
  • well what hope have you got otherwise? – thebluephantom Sep 24 '20 at 08:27
  • I was thinking a more tedious processing step would have been needed to ensure that score and id are synchronised. But I guess the lecturer wasn't wrong, in the limited example. Thanks! I'll accept it as the answer. – XJL Sep 24 '20 at 08:29
  • I see, so this question was similar to this https://stackoverflow.com/q/45822242/4466922 – XJL Sep 24 '20 at 08:32
  • Yes and as the stackoverflow link writes, beware : if the reading of the source file is dispatched among many threads / computers, one will start with a line number 1, but maybe not finishing at logical sub-record number 4, but the other can receive a part of source file beginning at logical number 2, 3 or 4. – Marc Le Bihan Sep 25 '20 at 06:52
  • @MarcLeBihan Not sure what your point is, but there are established aspects when reading and not yet shuffling with narrow transformations like filter. If unhappy, not convinced one can always use zipWithIndex and some smarts with a JOIN. – thebluephantom Sep 25 '20 at 06:59
  • @thebluephantom if you have a `master=local[5]` on your computer, it has chances that a 9,842,511 records file will be split in 1,968,502 and 1,968,503 records blocks, one for each thread. First thread will have entry #1, but the second, third, fourth, five thread, where the part of their file will start ? Maybe at `reviewer:Mary Jo`. – Marc Le Bihan Sep 25 '20 at 07:18
  • @MarcLeBihan splits have nothing to do with local[x]. Splits are splits. – thebluephantom Sep 25 '20 at 07:20
  • @thebluephantom no need to argue, read the end of the stackoverflow link. It explains what trouble can occur. – Marc Le Bihan Sep 25 '20 at 07:30
  • @MarcLeBihan I am pretty well versed with what it all means on boundaries etc. but we are talking about relative position with with zipped files. Anyway. – thebluephantom Sep 25 '20 at 07:42