Creating a function to extract number of movies, and each movie's properties from a URL

Question

I'm creating a function read_m_from_url(url, num_of_m=50) to extract num_of_m number of movies from a URL. It will also return a list of dictionaries, each of which represents a movie. Can someone tell me what I'm doing wrong in line 67 ( marked as a comment).

def read_m_from_url(url, num_of_m=50):
        #this function, read a number of movies from a url. That's say you set num_of_m=25, you want to read 25 movies from the page. The default value is 50
    #url = 'http://www.imdb.com/search/title?at=0&sort=user_rating&start=1&title_type=feature&year=2005,2016' #MAY NEED TO TAKE THIS OUT SINCE READ IS IN MAIN
    html_string = util.read_html(url) # given a url you need to read the hmtl file as a string. I have implemented this read_html function in util_imdb.py. Please take a look
        # create a soup object
    soup = BeautifulSoup(html_string, "html.parser")
        # Fetching a table that includes all the movies. In our lecture, we talked about find and find_all functions.
        #  for example, find_all('table') will give you all tables on the page. Actually, this find or find_all function can have two parameters,
        # in the code below 'table' is the tag name and 'results' is an attribute value of the tag. You can also do # movie_table = soup.find('table', {'class':'result'}).
        # Here you explicitly say: I want to find a table with attribute class = 'result'.
        # Since on each imdb page, there's only one table with class = 'results', we can use find rather than find_all. Find_all will return a list of table tags, while # find() will return only one table
    movie_table = soup.find('table', attrs = {'class': "results"}) # equivalent to  movie_table = soup.find('table', {'class':'result'})
    tables = movie_table[0] #line 67. create list for tables
    tb = tables.find_all('movie_table')[0]
    trs = tb.find_all('tr')
    list_movies = [] # initialize the return value, a list of movies
        # Using count track the number of movies processed. now it's 0 - No movie has been processed yet.
    count = 0 #increase count by 1 for every movie processed
    '''
    #Add your code here...., based on the following pseudo code.'''

    for tr in trs: # each row represents information of a movie
      dict_each_movie = {} # create an empty dictionary

      # your code to fetch title first.
      title = tr.findChildren('td', attrs= {'class': "title"})
      title = title.encode("ascii", "ignore") # convert the unicode string into an ascii string
      util.process_str_with_comma(title) # this method is in util_imdb.py.
      #  Sometimes, a title can include a comma (e.g. "Oh, My God!"). This will cause a problem
        # if your code outputs the title to a csv file. To deal with this problem, # we use quotation marks to enclose the title.
      # When you load the csv using the python package pandas
        # (or SAS or many other packages for processing csv), when pandas sees a string in csv enclosed in "", # it will recognized it as a cvs field with commas within it.
      dict_each_movie["title"] = title

      # your code to fetch year
      year = tr.findchildren('td', attrs= {'class': "year_type"})#tag is 'a' on page source.
      year = year.encode("ascii","ignore")
      dict_each_movie["rank"] = rank

      # your code to fetch year rank. Rank here means the number (such as 1.,2.) in front of the image of each image. Remove the '.'
      dotted_rank = tr.findChildren('td', attrs = {'class': "number"})
      rank = dotted_rank.replace(".", "") #takes out period at end
      rank = rank.encode("ascii","ignore")
      dict_each_movie["year"] = year

      # your code to fetch genres. Here I used try except; you can implement this part in a different way without using exception handling
      genres = [] # a movie can have a list of or none genre values
      try: # you need to deal with exception, since a movie may not have a tag for genres. If there are genres:
          genre = tr.findChildren('td', attrs = {'class': "genre"})
          genre = genre.encode("ascii", "ignore")
          genres.append(genre)
          #  '''find_all genres #add all the genres to the list "genres". Remember first encode('ascii', ignore) and then add to the list'''
      except:
          genres = []
          "do nothing. genres is still [], an empty list"
      finally: # whether an exception or not, you want to do the following
            dict_each_movie["genres"] = genres

      #your code to fetch runtime. Again there some movies that do not have runtime value
      runtime = ""
      try:
           runtime = tr.findchildren('td', attrs = {'class': "runtime"})#find runtime
           runtime = runtime.encode('ascii','ignore')
           runtime.remove('mins.')#a runtime string looks like "90 mins." you need to remove " mins."
      except:
           runtime = "" #do nothing
      finally:
           dict_each_movie["runtime"] = runtime

      #your code to fetch rating
      rating = tr.findChildren('td', attrs = {'class': "rating-rating"})
      rating = rating.runtime.encode('ascii','ignore')
      dict_each_movie["rating"] = rating

      list_movies.append(dict_each_movie)
      count += 1
      if count == num_of_m:
          break
      '''now we are done with processing a movie, increment count
      check if we have processed num_of_m movies (if count == num_of_m)? if so, break.'''

    return list_movies

def test_read_m_from_url():

    url = "http://www.imdb.com/search/title?at=0&sort=user_rating&start=51&title_type=feature&year=2005,2014"
        print read_m_from_url(url, 21)

line 67, in read_m_from_url tables = movie_table[0] #create list for tables — squidvision, Feb 15 '16 at 01:15
line 203, in main() line 198, in main test_read_m_from_url() line 141, in test_read_m_from_url print read_m_from_url(url, 21) line 67, in read_m_from_url tables = movie_table[0] #create list for tables line 958, in __getitem__ return self.attrs[key] KeyError: 0 — squidvision, Feb 15 '16 at 01:16
`soup.find` should be change to `soup_findall` and if `tables` is a list dosen't have any attribute `find_all` — ᴀʀᴍᴀɴ, Feb 15 '16 at 01:22
@Arman So for that portion, should it be: movie_table = soup.find_all('table', attrs = {'class': "results"}) tables = movie_table[0] trs = tables.find_all('tr') I'm not getting any errors in that portion, so I'm guessing I fixed it....??? — squidvision, Feb 15 '16 at 01:35
Run edited code and report errors here and edit your question — ᴀʀᴍᴀɴ, Feb 15 '16 at 01:36

Creating a function to extract number of movies, and each movie's properties from a URL

0 Answers0