2

I try to follow the course and I'm stuck in one example because the website content and tags got changed. In a course the tag looks:

now it's but even when I change the class I can't return anything. I'd like to scrape the movie Titles and Images. html image picture is here.

enter image description here

response = requests.get('https://www.empireonline.com/movies/features/best-movies-2')
best_movies = response.text

soup = BeautifulSoup(best_movies, 'html.parser')

titles = soup.find_all(name = 'h3', class_ = 'jsx-2692754980')
print(titles)

imgs = soup.find_all(name='img', class_='jsx-4015086601')
print(imgs)
imgs_li=[]
for e in imgs:
    link = e.get('src')
    imgs_li.append(link)
print(imgs_li)

3 Answers3

2

You might want to explore selenium along with BeautifulSoup.

Here's how:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)

driver.get("https://www.empireonline.com/movies/features/best-movies-2")
soup = BeautifulSoup(driver.page_source, "html.parser").find_all("img")

movies = []
for image in soup:
    try:
        if image["alt"]:
            movies.append([image["alt"], f"https:{image['data-src']}"])
    except KeyError:
        continue

for movie in movies[1:]:
    title, link = movie
    print(f"{title}\n{link}\n{'-' * 80}")

Output:

Stand By Me
https://cdn.onebauer.media/one/media/5e62/24d4/08ba/aa5a/8143/279c/stand-by-me.jpg?format=jpg&quality=80&width=500&ratio=1-1&resize=aspectfit
--------------------------------------------------------------------------------
Raging Bull
https://cdn.onebauer.media/one/media/5d2d/d990/853e/7cd6/60cc/fa2e/raging-bull.jpg?format=jpg&quality=80&width=500&ratio=1-1&resize=aspectfit
--------------------------------------------------------------------------------
Amelie
https://cdn.onebauer.media/one/empire-images/features/59395a49f68e659c7aa3a1a8/Amelie.jpg?format=jpg&quality=80&width=500&ratio=1-1&resize=aspectfit
--------------------------------------------------------------------------------
Leonardo DiCaprio and Kate Winslet in Titanic
https://cdn.onebauer.media/one/lifestyle-images/celebrity/59d4ac2c07c78ace382c4735/kate-winslet-leonardo-dicaprio-titanic.jpg?format=jpg&quality=80&width=500&ratio=1-1&resize=aspectfit
--------------------------------------------------------------------------------
Good Will Hunting
https://cdn.onebauer.media/one/media/5e62/2a32/2cd5/547b/bf0f/6416/good-will-hunting.jpg?format=jpg&quality=80&width=500&ratio=1-1&resize=aspectfit
--------------------------------------------------------------------------------
Arrival
https://cdn.onebauer.media/one/media/5e62/2ac7/2eea/4450/3534/4b45/Arrival.jpg?format=jpg&quality=80&width=500&ratio=1-1&resize=aspectfit
--------------------------------------------------------------------------------
Lost In Translation
https://cdn.onebauer.media/one/media/5e62/2b5f/232f/f064/694b/c738/lost-in-translation.jpg?format=jpg&quality=80&width=500&ratio=1-1&resize=aspectfit
--------------------------------------------------------------------------------
The Princess Bride
https://cdn.onebauer.media/one/media/5e62/2bf3/08ba/aa7b/8f43/27e0/the-princess-bride.jpg?format=jpg&quality=80&width=500&ratio=1-1&resize=aspectfit
--------------------------------------------------------------------------------
The Terminator
https://cdn.onebauer.media/one/empire-images/features/59395a49f68e659c7aa3a1a8/The%2520Terminator.jpg?format=jpg&quality=80&width=500&ratio=1-1&resize=aspectfit
--------------------------------------------------------------------------------

and so on ...
baduker
  • 19,152
  • 9
  • 33
  • 56
1

This website is dynamic, so using bs4 will not work here (see page source). I would recommend you using selenium to grab page source and pass it in soup object. Here is the sample code to do this:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager

url = 'https://www.empireonline.com/movies/features/best-movies-2'
chrome_driver_path = 'chromedriver'

chrome_options = Options()
chrome_options.add_argument('--headless')

webdriver = webdriver.Chrome(ChromeDriverManager().install())


with webdriver as driver:
    # Set timeout time
    wait = WebDriverWait(driver, 10)

    # Retrieve url in headless browser
    driver.get(url)

    html = driver.page_source

    driver.close()


soup = BeautifulSoup(html, 'html.parser')

titles = soup.find_all(name='h3', class_='jsx-2692754980')
titles = [i.text for i in titles if i.text is not None]
print(titles)

imgs = soup.find('div', class_='jsx-3821216435').find_all('img')
print(imgs)

The results for titles and imgs are:

titles -- ['100) Stand By Me', '99) Raging Bull', '98) Amelie', '97) Titanic', '96) Good Will Hunting', '95) Arrival', '94) Lost In Translation' ... ]

imgs --- [<img alt="Stand By Me" class="jsx-952983560 loading" data-src="//cdn.onebauer.media/one/media/5e62/24d4/08ba/aa5a/8143/279c/stand-by-me.jpg?format=jpg&amp;quality=80&amp;width=500&amp;ratio=1-1&amp;resize=aspectfit" src="" title=""/>, <img alt="Raging Bull" class="jsx-952983560 loading" data-src="//cdn.onebauer.media/one/media/5d2d/d990/853e/7cd6/60cc/fa2e/raging-bull.jpg?format=jpg&amp;quality=80&amp;width=500&amp;ratio=1-1&amp;resize=aspectfit" src="" title=""/>, ... ]

Note that you need to pip install selenium then download chromedriver and put it in the same directory with script.

Rustam Garayev
  • 2,632
  • 1
  • 9
  • 13
0

Given Selenium might be a little advanced, you may try using what you already know, how about using string tokenisation?

import requests
from bs4 import BeautifulSoup


url = "https://www.empireonline.com/movies/features/best-movies-2/"
movies_page = requests.get(url)
page = movies_page.text

soup = BeautifulSoup(page, 'html.parser')

long_string = str(soup).split('"title":"100 Greatest Movies 2021"')[1]
parts = long_string.split('","altText":"')[:101]
with open('top_100_movies.txt', 'a') as file:
    for part in parts[::-1]:
        file.write(part.split(',"titleText":"')[1])
        file.write('\n')

You can still achieve the required result