1

Goal#

  • Print a dataset with Movie Name & Number of times it has been rated.
  • That's a simple way to get the most "popular" movie

Data

  • One file called "u.data" with movieID, userID, ratings, timestamp
  • One file called "u.item" with movieID and movie name and information about - each movie

Method

  • Create a dictionnary key = MovieID, values = Name from the u.item files
  • Broadcast the dictionary to the executioner nodes on the cluster
  • Create a rdd with the MovieID and 1 on each line
  • Reduce this rdd by movieID and sum each one
  • Flip the key(movieID) and the value(Total) to sort the dataset by this total

Issue

  • Then I should map the movieID with the broadcasted dictionary but I get a syntax error on this line:
    sortedMoviesWithNames = sortedMovies.map(lambda (count, movie) : (nameDict.value[movie], count))

This code example is from cookbook for Apache Spark and Python. All others codings exercises work perfectly on my environment. Windows 10 / Canopy / Python 3.5 / Spark 2.3.2 I've checked the broadcasted dictionary it's ok, and already print the sortedMovies RDD which is ok too. I've checked the online errata of the book, nothing too.

I'm wondering if this is a syntax error due to the Python version or something like that.

from pyspark import SparkConf, SparkContext

def loadMovieNames():
    movieNames = {}
    with open("ml-100k/u.item") as f:
        for line in f:
            fields = line.split('|')
            movieNames[int(fields[0])] = fields[1]
    return movieNames

conf = SparkConf().setMaster("local").setAppName("PopularMovies")
sc = SparkContext(conf = conf)

nameDict = sc.broadcast(loadMovieNames())

lines = sc.textFile("file:///SparkCourse/ml-100k/u.data")
movies = lines.map(lambda x: (int(x.split()[1]), 1))
movieCounts = movies.reduceByKey(lambda x, y: x + y)

flipped = movieCounts.map(lambda x: (x[1], x[0]))
sortedMovies = flipped.sortByKey()

sortedMoviesWithNames = sortedMovies.map(lambda (count, movie) : 
(nameDict.value[movie], count))

results = sortedMoviesWithNames.collect()

for result in results:
    print(result)
  • This might be a case of "tuple parameter unpacking" that has been removed from Python 3 (see [Nested arguments not compiling](https://stackoverflow.com/questions/10607293/nested-arguments-not-compiling)). – user2314737 Dec 27 '18 at 17:17
  • That's it. Thanks. I wrote below the correct syntax in my comment of the previous answer. – Slimpunkerz Dec 27 '18 at 23:46

1 Answers1

1

I believe the correct syntax for a lambda with multiple arguments is:

sum_function = lambda a, b: a + b

Note the missing parentheses. If you are trying to map a tuple to another tuple, you will need to do something like:

lambda tup: (nameDict.value[tup[1]], tup[0])

Python functions do not automatically unpack tuples, so a multi-argument function will not accept a tuple for its arguments and have it work properly (of course, that's what the * operator is for).

John Adams
  • 36
  • 1
  • Thanks for this quick and good answer. `sortedMoviesWithNames = sortedMovies.map(lambda moviesDim : (nameDict.value[moviesDim[1]], moviesDim[0]))` works well ! – Slimpunkerz Dec 27 '18 at 13:45