What is wrong with this Spark RDD mapped with a lambda function with two arguments?

Question

Goal#

Print a dataset with Movie Name & Number of times it has been rated.
That's a simple way to get the most "popular" movie

Data

One file called "u.data" with movieID, userID, ratings, timestamp
One file called "u.item" with movieID and movie name and information about - each movie

Method

Create a dictionnary key = MovieID, values = Name from the u.item files
Broadcast the dictionary to the executioner nodes on the cluster
Create a rdd with the MovieID and 1 on each line
Reduce this rdd by movieID and sum each one
Flip the key(movieID) and the value(Total) to sort the dataset by this total

Issue

Then I should map the movieID with the broadcasted dictionary but I get a syntax error on this line:
sortedMoviesWithNames = sortedMovies.map(lambda (count, movie) : (nameDict.value[movie], count))

This code example is from cookbook for Apache Spark and Python. All others codings exercises work perfectly on my environment. Windows 10 / Canopy / Python 3.5 / Spark 2.3.2 I've checked the broadcasted dictionary it's ok, and already print the sortedMovies RDD which is ok too. I've checked the online errata of the book, nothing too.

I'm wondering if this is a syntax error due to the Python version or something like that.

from pyspark import SparkConf, SparkContext

def loadMovieNames():
    movieNames = {}
    with open("ml-100k/u.item") as f:
        for line in f:
            fields = line.split('|')
            movieNames[int(fields[0])] = fields[1]
    return movieNames

conf = SparkConf().setMaster("local").setAppName("PopularMovies")
sc = SparkContext(conf = conf)

nameDict = sc.broadcast(loadMovieNames())

lines = sc.textFile("file:///SparkCourse/ml-100k/u.data")
movies = lines.map(lambda x: (int(x.split()[1]), 1))
movieCounts = movies.reduceByKey(lambda x, y: x + y)

flipped = movieCounts.map(lambda x: (x[1], x[0]))
sortedMovies = flipped.sortByKey()

sortedMoviesWithNames = sortedMovies.map(lambda (count, movie) : 
(nameDict.value[movie], count))

results = sortedMoviesWithNames.collect()

for result in results:
    print(result)

This might be a case of "tuple parameter unpacking" that has been removed from Python 3 (see [Nested arguments not compiling](https://stackoverflow.com/questions/10607293/nested-arguments-not-compiling)). — user2314737, Dec 27 '18 at 17:17
That's it. Thanks. I wrote below the correct syntax in my comment of the previous answer. — Slimpunkerz, Dec 27 '18 at 23:46

score 1 · Accepted Answer · answered Dec 26 '18 at 21:53

1

I believe the correct syntax for a lambda with multiple arguments is:

sum_function = lambda a, b: a + b

Note the missing parentheses. If you are trying to map a tuple to another tuple, you will need to do something like:

lambda tup: (nameDict.value[tup[1]], tup[0])

Python functions do not automatically unpack tuples, so a multi-argument function will not accept a tuple for its arguments and have it work properly (of course, that's what the * operator is for).

answered Dec 26 '18 at 21:53

John Adams

36
1

Thanks for this quick and good answer. `sortedMoviesWithNames = sortedMovies.map(lambda moviesDim : (nameDict.value[moviesDim[1]], moviesDim[0]))` works well ! – Slimpunkerz Dec 27 '18 at 13:45

What is wrong with this Spark RDD mapped with a lambda function with two arguments?

Goal#

Data

Method

Issue

1 Answers1