0

I try to get data from website using BeautifulSoup but I get an empty list. Also tried with "html.parser" but it is also not helping. Please help me to find a solution. Thank you very much.

My code:

from bs4 import BeautifulSoup
import requests

response = requests.get("https://www.empireonline.com/movies/features/best-movies-2/")

movies_webpage = response.text
soup = BeautifulSoup(movies_webpage, "html.parser")
all_movies = soup.find_all(name="h3", class_="jsx-2692754980")
movie_titles = [movie.getText() for movie in all_movies]
print(movie_titles)

Output:

[]
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
webclb
  • 11
  • 2
  • The site's behind `JavaScript` so you won't scrape it with `bs4`. – baduker Mar 02 '21 at 10:07
  • [data can be extracted from the page source using regular expression](https://stackoverflow.com/questions/72171379/beautifulsoup-scraping-results-not-showing/73650133#73650133) as in the similar answer I've answered. – Denis Skopa Dec 15 '22 at 12:02

2 Answers2

0

What happens?

Response do not contain the h3 elements cause content of the website is served dynamically.

How to fix?

You can use the json information from the response or use selenium to request the site and get the content as expected

Example

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
url = 'https://www.empireonline.com/movies/features/best-movies-2/'
driver.get(url)

soup = BeautifulSoup(driver.page_source, 'html.parser')

all_movies = soup.find_all("h3", class_="jsx-2692754980")
movie_titles = [movie.getText() for movie in all_movies]
print(movie_titles)
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
0

In this question, you can not use selenium, because it slows down the scraping process, it will be enough to use only BeautifulSoup using regular expressions.

The movie list data is located in page source in the inline JSON.

In order to extract data from inline JSON you need:

  1. open page source CTRL + U;
  2. find the data (title, name, etc.) CTRL + F;
  3. using regular expression to extract parts of the inline JSON:
# https://regex101.com/r/CqzweN/1
portion_of_script = re.findall("\"Author:42821\"\:{(.*)\"Article:54591\":", str(all_script))
  1. retrieve the list of movies directly:
# https://regex101.com/r/jRgmKA/1
movie_list = re.findall("\"titleText\"\:\"(.*?)\"", str(portion_of_script))

We can also get snippet and image using CSS selectors because they are not rendered with JavaScript. You can use SelectorGadget Chrome extension to define CSS selectors.

Check code in online IDE.

from bs4 import BeautifulSoup
import requests, re, json, lxml

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}

html = requests.get("https://www.empireonline.com/movies/features/best-movies-2/", headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

movie_data = []

movie_snippets = []

movie_images = []

all_script = soup.select("script")

# https://regex101.com/r/CqzweN/1
portion_of_script = re.findall("\"Author:42821\"\:{(.*)\"Article:54591\":", str(all_script))

# https://regex101.com/r/jRgmKA/1
movie_list = re.findall("\"titleText\"\:\"(.*?)\"", str(portion_of_script))

for snippets in soup.select(".listicle-item"):
    movie_snippets.append(snippets.select_one(".listicle-item-content p:nth-child(1)").text)
  
for image in soup.select('.image-container img'):
    movie_images.append(f"https:{image['data-src']}")
  
# [1:] exclude first unnecessary result
for movie, snippet, image in zip(movie_list, movie_snippets, movie_images[1:]):
    movie_data.append({
    "movie_list": movie,
    "movie_snippet": snippet,
    "movie_image": image
  })
  
print(json.dumps(movie_data, indent=2, ensure_ascii=False))

Example output:

[
    {
    "movie_list": "11) Star Wars",
    "movie_snippet": "George Lucas' cocktail of fantasy, sci-fi, Western and World War II movie remains as culturally pervasive as ever. It's so mythically potent, you sense in time it could become a bona-fide religion...",
    "movie_image": "https://images.bauerhosting.com/legacy/media/619d/b9f5/3ebe/477b/3f9c/e48a/11%20Star%20Wars.jpg?q=80&w=500"
  },
  {
    "movie_list": "10) Goodfellas",
    "movie_snippet": "Where Coppola embroiled us in the politics of the Mafia elite, Martin Scorsese drew us into the treacherous but seductive world of the Mob's foot soldiers. And its honesty was as impactful as its sudden outbursts of (usually Joe Pesci-instigated) violence. Not merely via Henry Hill's (Ray Liotta) narrative, but also Karen's (Lorraine Bracco) perspective: when Henry gives her a gun to hide, she admits, \"It turned me on.\"",
    "movie_image": "https://images.bauerhosting.com/legacy/media/619d/ba59/5165/43e0/333b/7c6f/10%20Goodfellas.jpg?q=80&w=500"
  },
  {
    "movie_list": "9) Raiders Of The Lost Ark",
    "movie_snippet": "In '81, it must have sounded like the ultimate pitch: the creator of Star Wars teams up with the director of Jaws to make a rip-roaring, Bond-style adventure starring the guy who played Han Solo, in which the bad guys are the evillest ever (the Nazis) and the MacGuffin is a big, gold box which unleashes the power of God. It still sounds like the ultimate pitch.",
    "movie_image": "https://images.bauerhosting.com/legacy/media/619d/bb13/f590/5e77/c706/49ac/9%20Raiders.jpg?q=80&w=500"
  },
  # ...
]

There's a 13 ways to scrape any public data from any website blog post if you want to know more about website scraping.

Denis Skopa
  • 1
  • 1
  • 1
  • 7