SOLVED
solution at the end of the question....
I'm making a map reduce code using MRjob in python and i have a CSV dataset following are few rows from the dataset.
column headings
Year Length Title Genre Actor Actress Director Popularity Awards *Image
Row 1
1990,111,Tie Me Up! Tie Me Down!,Comedy,"Banderas, Antonio","Abril, Victoria","Almodóvar, Pedro",68,No,"NicholasCage.png,,"
Row 2
1991,113,High Heels,Comedy,"Bosé, Miguel","Abril, Victoria","Almodóvar, Pedro",68,No,"NicholasCage.png,,"
now in mapper I am splitting on basis of comma separation but some columns like movie name genre director actor or actress also have ', ' in them a comma with a space which is causing a problem as I am new to map reduce I'm unable to identify how can i split them into perfectly splitted columns
code for finding total number of movies of each genre. Considering only the movies released after 1970 and have a length greater than 75 minutes.
class MovieGenreCount(MRJob):
def mapper(self, _, line):
movie = line.split(',')
year = int(movie[0])
length = int(movie[1])
genre = movie[2]
if year > 1970 and length > 75:
yield genre, 1
def combiner(self, genre, counts):
yield genre, sum(counts)
def reducer(self, genre, counts):
yield genre, sum(counts)
if __name__ == '__main__':
MovieGenreCount.run()
SOLUTION
import csv
class MovieGenreCount(MRJob):
def mapper(self, _, line):
row = list(csv.reader([line]))[0]
year, length, title, genre, actor, actress, director, popularity, awards, image = row
if year and length:
year = int(year)
length = int(length)
if year > 1970 and length > 75:
yield genre, 1
def combiner(self, genre, counts):
yield genre, sum(counts)
def reducer(self, genre, counts):
yield genre, sum(counts)
if __name__ == '__main__':
MovieGenreCount.run()