2

I am trying to collect data from IMDb using python but I can't manage to get all the reviews. I have the following code that works but dosent give all the reviews available :

from imdb import IMDb

ia = IMDb()

ia.get_movie_reviews('13433812') 

output :

`{'data': {'reviews': [{'content': 'Just finished watching the episode 4. Wow, it was so good. Well made mixture of thriller and comedy.I saw a few negative reviews here written after eps 1 or 2. I recommend watching at least up to eps 3 and 4. The real story starts from eps 3. Eps 4 is like a complete well made movie. You will surely enjoy it.',
'helpful': 0,
'title': '',
'author': 'ur129930427',
'date': '28 February 2021',
'rating': None,
'not_helpful': 0},


`{'content': 'You can see the cast had a lot of fun making this Italian/Korean would-be mafia thriller, the sort of fun NOT experienced in Hollywood since the days of Burt Reynolds. Vincenzo contains a very absorbing plot, a cast star-struck by designer clothes, interspersed with Italian (and other) Classical music excerpts to set in relief some well written suspense and intrigue. The plot centers on, if we really are to believe it, the endemically CORRUPT upper echelons of S. Korean society. Is it a coincidence that many of the systemic abuses of power and institutional vice that constitute Vincenzo\'s Main Plot are now also going on, this very moment in the USA? It is certainly food for thought. A clear advantage this Korean drama has over mediocre US shows, however is a much softer-handed use of violence, resorting more often to satire to keep the plot moving as opposed to gratuitous savagery now so common in so-called "hit" US shows. So far, so good, Binjenzo!'``

I have also tried Scrapy code but I didn't get any reviews :

from scrapy.http import TextResponse
import urllib.parse
from urllib.parse import urljoin
base_url = "https://www.imdb.com/title/tt13433812/reviews?ref_=tt_urv"
r=requests.get(base_url)
response = TextResponse(r.url, body=r.text, encoding='utf-8')
reviews = response.xpath('//*[contains(@id,"1")]/p/text()').extract()
len(reviews)
output : 0
dkw
  • 23
  • 2

3 Answers3

2

This should give you all the reviewer names from that page exhausting all the load more buttons. Feel free to define other fields to fetch them according to your requirements.

import requests
from bs4 import BeautifulSoup

start_url = 'https://www.imdb.com/title/tt13433812/reviews?ref_=tt_urv'
link = 'https://www.imdb.com/title/tt13433812/reviews/_ajax'

params = {
    'ref_': 'undefined',
    'paginationKey': ''
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(start_url)

    while True:
        soup = BeautifulSoup(res.text,"lxml")
        for item in soup.select(".review-container"):
            reviewer_name = item.select_one("span.display-name-link > a").get_text(strip=True)
            print(reviewer_name)


        try:
            pagination_key = soup.select_one(".load-more-data[data-key]").get("data-key")
        except AttributeError:
            break
        params['paginationKey'] = pagination_key
        res = s.get(link,params=params)
SIM
  • 21,997
  • 5
  • 37
  • 109
  • I followed your code and I'm getting the following error : File "", line 13 except AttributeError: ^ IndentationError: unindent does not match any outer indentation level – dkw Jul 05 '21 at 09:22
  • I don't understand where does this `tokenize` thing come from? Are you using `nltk` library and stuff together with my suggested one? – SIM Jul 05 '21 at 09:37
  • The copy and paste did not take into account the spacing. It is now working but it is giving me the names and not the reviews themselves. How would I go about getting the reviews? – dkw Jul 05 '21 at 14:00
  • Try this `review = item.select_one(".content > .text").get_text(strip=True)` for review. – SIM Jul 05 '21 at 14:18
  • I have it!!! Honestly, I can't thank you enough. I have been looking at code for 5 days straight. AMAZING! – dkw Jul 05 '21 at 16:23
0

Do you see the Load More button at the end of the page ?

The reason you are not able to get all reviews is because the reveiws are being loaded by an AJAX request upon clicking on Load More.

You need to use Selenium to click on that button and then extract the reviews.

Ram
  • 4,724
  • 2
  • 14
  • 22
0

You can also use selenium to continuously click the "load more" button until all the reviews are loaded:

from selenium import webdriver
import time, urllib.parse
from bs4 import BeautifulSoup as soup
d = webdriver.Chrome('/path/to/chromedriver')
d.get((l:='https://www.imdb.com/title/tt13433812/reviews?ref_=tt_urv'))
while int(d.execute_script("return Array.from(document.querySelectorAll('#main .review-container')).length")) < int(d.execute_script("return document.querySelector('.header span').textContent").split()[0]):
   d.execute_script('document.querySelector(".ipl-load-more__button").click()')
   time.sleep(3)

r = [{'score':i.select_one('span.rating-other-user-rating span:nth-of-type(1)').get_text(strip=True),
      'title':i.select_one('a.title').get_text(strip=True),
      'reviewer_name':(j:=i.select_one('.display-name-link > a')).get_text(strip=True),
      'reviewer_link':urllib.parse.urljoin(l, j['href']),
      'date':i.select_one('.display-name-link > .review-date').get_text(strip=True),
      'review':i.select_one('.content > .text').get_text(strip=True)
    } 
    for i in soup(d.page_source, 'html.parser').select('#main .review-container')]
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • I'm getting a number of issues. I looked into this but I had to install Brew to access the web driver as I kept getting this message "Message: 'chromedriver' executable needs to be in PATH." but after installing Brew on my terminal its not connecting to my Jupiter terminal (apologies if I am using the wrong terminology). – dkw Jul 05 '21 at 09:18