Urllib.request not parsing correctly

Question

I'm trying to scrape some numbers off of the TD Ameritrade website with urllib.request and beautiful soup, but I think that the website has some sort of program that changes the numbers to incorrect ones to prevent web scraping. For example, when I try to parse the next earnings date from the url 'https://research.tdameritrade.com/grid/public/research/stocks/earnings?symbol=goog', it returns "(Unconfirmed) July 25, 2022", when the earnings date displayed on the website's HTML file is "July 26, 2022".

Is this true, or is there something just wrong with my code? Is there any way to get around this?

from urllib.request import Request,urlopen
from bs4 import BeautifulSoup as soup

url = 'https://research.tdameritrade.com/grid/public/research/stocks/earnings?symbol=goog'

request_site = Request(url)
page_html = urlopen(request_site).read()
page_soup = soup(page_html, "html.parser")


earnings = page_soup.findAll("td", {"class": "value week-of"})

earnings = earnings[0].text
print(earnings)

Does this answer your question? [Why is HTML returned by requests different from the real page HTML?](https://stackoverflow.com/questions/65186906/why-is-html-returned-by-requests-different-from-the-real-page-html) — Alexander, Jul 15 '22 at 00:45

score 0 · Answer 1 · answered Aug 29 '22 at 09:12

Have a look at SelectorGadget Chrome extension to easily select the desired CSS selector by clicking on the desired HTML element in the browser.

The CSS selector we need is .week-of:

earnings = soup.select_one(".week-of").text

# if you want to extract date without "(Unconfirmed)". You also need to import re module
# https://regex101.com/r/wMwdYw/1
earnings = re.search(r"\w+\s?\d+,\s?\d{4}", soup.select_one(".week-of").text).group(0)

Keep in mind that this selector will be changed once estimated earnings data will be present. It will throw an AttributeError: 'NoneType' error as there will be no .week-of selector because it will be either deleted or changed.

Check code in online IDE.

from bs4 import BeautifulSoup
import requests, lxml

# https://requests.readthedocs.io/en/latest/user/quickstart/#passing-parameters-in-urls
params = {
    "symbol": "goog"    # symbol 
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

html = requests.get("https://research.tdameritrade.com/grid/public/research/stocks/earnings", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

earnings = soup.select_one(".week-of").text

print(earnings)

Example output

(Unconfirmed) October 25, 2022

Urllib.request not parsing correctly

1 Answers1