In this question, you can not use selenium
, because it slows down the scraping process, it will be enough to use only BeautifulSoup
using regular expressions.
The movie list data is located in page source in the inline JSON.
In order to extract data from inline JSON you need:
- open page source
CTRL + U
;
- find the data (title, name, etc.)
CTRL + F
;
- using regular expression to extract parts of the inline JSON:
# https://regex101.com/r/CqzweN/1
portion_of_script = re.findall("\"Author:42821\"\:{(.*)\"Article:54591\":", str(all_script))
- retrieve the list of movies directly:
# https://regex101.com/r/jRgmKA/1
movie_list = re.findall("\"titleText\"\:\"(.*?)\"", str(portion_of_script))
We can also get snippet and image using CSS selectors because they are not rendered with JavaScript. You can use SelectorGadget Chrome extension to define CSS selectors.
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, re, json, lxml
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}
html = requests.get("https://www.empireonline.com/movies/features/best-movies-2/", headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
movie_data = []
movie_snippets = []
movie_images = []
all_script = soup.select("script")
# https://regex101.com/r/CqzweN/1
portion_of_script = re.findall("\"Author:42821\"\:{(.*)\"Article:54591\":", str(all_script))
# https://regex101.com/r/jRgmKA/1
movie_list = re.findall("\"titleText\"\:\"(.*?)\"", str(portion_of_script))
for snippets in soup.select(".listicle-item"):
movie_snippets.append(snippets.select_one(".listicle-item-content p:nth-child(1)").text)
for image in soup.select('.image-container img'):
movie_images.append(f"https:{image['data-src']}")
# [1:] exclude first unnecessary result
for movie, snippet, image in zip(movie_list, movie_snippets, movie_images[1:]):
movie_data.append({
"movie_list": movie,
"movie_snippet": snippet,
"movie_image": image
})
print(json.dumps(movie_data, indent=2, ensure_ascii=False))
Example output:
[
{
"movie_list": "11) Star Wars",
"movie_snippet": "George Lucas' cocktail of fantasy, sci-fi, Western and World War II movie remains as culturally pervasive as ever. It's so mythically potent, you sense in time it could become a bona-fide religion...",
"movie_image": "https://images.bauerhosting.com/legacy/media/619d/b9f5/3ebe/477b/3f9c/e48a/11%20Star%20Wars.jpg?q=80&w=500"
},
{
"movie_list": "10) Goodfellas",
"movie_snippet": "Where Coppola embroiled us in the politics of the Mafia elite, Martin Scorsese drew us into the treacherous but seductive world of the Mob's foot soldiers. And its honesty was as impactful as its sudden outbursts of (usually Joe Pesci-instigated) violence. Not merely via Henry Hill's (Ray Liotta) narrative, but also Karen's (Lorraine Bracco) perspective: when Henry gives her a gun to hide, she admits, \"It turned me on.\"",
"movie_image": "https://images.bauerhosting.com/legacy/media/619d/ba59/5165/43e0/333b/7c6f/10%20Goodfellas.jpg?q=80&w=500"
},
{
"movie_list": "9) Raiders Of The Lost Ark",
"movie_snippet": "In '81, it must have sounded like the ultimate pitch: the creator of Star Wars teams up with the director of Jaws to make a rip-roaring, Bond-style adventure starring the guy who played Han Solo, in which the bad guys are the evillest ever (the Nazis) and the MacGuffin is a big, gold box which unleashes the power of God. It still sounds like the ultimate pitch.",
"movie_image": "https://images.bauerhosting.com/legacy/media/619d/bb13/f590/5e77/c706/49ac/9%20Raiders.jpg?q=80&w=500"
},
# ...
]
There's a 13 ways to scrape any public data from any website blog post if you want to know more about website scraping.