Goal#
- Print a dataset with Movie Name & Number of times it has been rated.
- That's a simple way to get the most "popular" movie
Data
- One file called "u.data" with movieID, userID, ratings, timestamp
- One file called "u.item" with movieID and movie name and information about - each movie
Method
- Create a dictionnary key = MovieID, values = Name from the u.item files
- Broadcast the dictionary to the executioner nodes on the cluster
- Create a rdd with the MovieID and 1 on each line
- Reduce this rdd by movieID and sum each one
- Flip the key(movieID) and the value(Total) to sort the dataset by this total
Issue
- Then I should map the movieID with the broadcasted dictionary but I get a syntax error on this line:
sortedMoviesWithNames = sortedMovies.map(lambda (count, movie) : (nameDict.value[movie], count))
This code example is from cookbook for Apache Spark and Python. All others codings exercises work perfectly on my environment. Windows 10 / Canopy / Python 3.5 / Spark 2.3.2
I've checked the broadcasted dictionary it's ok, and already print the sortedMovies RDD which is ok too. I've checked the online errata of the book, nothing too.
I'm wondering if this is a syntax error due to the Python version or something like that.
from pyspark import SparkConf, SparkContext
def loadMovieNames():
movieNames = {}
with open("ml-100k/u.item") as f:
for line in f:
fields = line.split('|')
movieNames[int(fields[0])] = fields[1]
return movieNames
conf = SparkConf().setMaster("local").setAppName("PopularMovies")
sc = SparkContext(conf = conf)
nameDict = sc.broadcast(loadMovieNames())
lines = sc.textFile("file:///SparkCourse/ml-100k/u.data")
movies = lines.map(lambda x: (int(x.split()[1]), 1))
movieCounts = movies.reduceByKey(lambda x, y: x + y)
flipped = movieCounts.map(lambda x: (x[1], x[0]))
sortedMovies = flipped.sortByKey()
sortedMoviesWithNames = sortedMovies.map(lambda (count, movie) :
(nameDict.value[movie], count))
results = sortedMoviesWithNames.collect()
for result in results:
print(result)