splitting comma separated data in python

Question

SOLVED

solution at the end of the question....

I'm making a map reduce code using MRjob in python and i have a CSV dataset following are few rows from the dataset.

column headings

Year Length Title Genre Actor Actress Director Popularity Awards *Image

Row 1

1990,111,Tie Me Up! Tie Me Down!,Comedy,"Banderas, Antonio","Abril, Victoria","Almodóvar, Pedro",68,No,"NicholasCage.png,,"

Row 2

1991,113,High Heels,Comedy,"Bosé, Miguel","Abril, Victoria","Almodóvar, Pedro",68,No,"NicholasCage.png,,"

now in mapper I am splitting on basis of comma separation but some columns like movie name genre director actor or actress also have ', ' in them a comma with a space which is causing a problem as I am new to map reduce I'm unable to identify how can i split them into perfectly splitted columns

code for finding total number of movies of each genre. Considering only the movies released after 1970 and have a length greater than 75 minutes.


class MovieGenreCount(MRJob):

    def mapper(self, _, line):
        movie = line.split(',')
        year = int(movie[0])
        length = int(movie[1])
        genre = movie[2]
        if year > 1970 and length > 75:
            yield genre, 1
    
    def combiner(self, genre, counts):
        yield genre, sum(counts)
    
    def reducer(self, genre, counts):
        yield genre, sum(counts)
        
if __name__ == '__main__':
    MovieGenreCount.run()

SOLUTION

import csv

class MovieGenreCount(MRJob):

    def mapper(self, _, line):
        row = list(csv.reader([line]))[0]
        year, length, title, genre, actor, actress, director, popularity, awards, image = row
        if year and length:
            year = int(year)
            length = int(length)
            if year > 1970 and length > 75:
                yield genre, 1
    
    def combiner(self, genre, counts):
        yield genre, sum(counts)
    
    def reducer(self, genre, counts):
        yield genre, sum(counts)
        
if __name__ == '__main__':
    MovieGenreCount.run()

@Barmar im not particularly familiar with csv module but i did try it this way `row = list(csv.reader([line]))[0] year, length, title, genre, actor, actress, director, popularity, awards, image = row` but it converts all values to string and even when i tried typecasting on year and length using int() the code is giving following error ValueError: invalid literal for int() with base 10: '' — hadi khan, Mar 22 '23 at 22:04
@Barmar there are empty values in length but how should i handle them? — hadi khan, Mar 22 '23 at 22:10
@it worked rather than putting 0's i only implied the check a didn't changed or inserted any values — hadi khan, Mar 22 '23 at 22:13

splitting comma separated data in python

0 Answers0