1

I am running a web scraping task on Invest's website, the code below has always worked and started to give an error for a reason I am not identifying:


from bs4 import BeautifulSoup
from lxml import etree
from urllib.request import Request, urlopen
import re
import httpx

def get_investing_direct_url(url):
    while True:
        with httpx.Client() as htx:
            response = htx.get(url, headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}, timeout=(20.0, 20.0))
            bs = BeautifulSoup(response.text, 'lxml')
            percent = bs.find('span', {'class':re.compile('price_change-percent')}).text
            percent = percent.replace('(','').replace(')','')
        
        if percent:
            break
    
    return percent


SP = get_investing_direct_url('https://br.investing.com/indices/us-spx-500-futures?cid=1175153')


"I am getting the following error"

Traceback (most recent call last):
  File "c:\Users\Leon\Desktop\Whatsap Bot\123\performance_mais_index.py", line 26, in <module>
    SP = get_investing_direct_url('https://br.investing.com/indices/us-spx-500-futures?cid=1175153')
  File "c:\Users\Leon\Desktop\Whatsap Bot\123\performance_mais_index.py", line 17, in get_investing_direct_url
    percent = bs.find('span', {'class':re.compile('price_change-percent')}).text
AttributeError: 'NoneType' object has no attribute 'text'

After some headache, I was able to pull the values I need using only selenium as in the code below:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
import re

def get_investing_direct_url(url):
    options = Options()
    options.headless = True
    driver = webdriver.Firefox(options=options)

    driver.get(url)
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')

    percent_section = soup.find('span', {'class':re.compile('price_change-percent')})

    if percent_section:
        percent = percent_section.text.replace('(','').replace(')','')
        driver.quit()
        return percent

    driver.quit()
    return None

SP = get_investing_direct_url('https://br.investing.com/indices/us-spx-500-futures?cid=1175153')
print("SP variation: ", SP)
SP variation:  -0,38%

enter image description here

My doubts: For this page model, would I only be able to use selenium? This code with selenium takes more than 1 minute to return the data. Would there be a way to speed up this response? Can I do it without selenium?

Could you help me with this problem? I've tried everything and I don't know what else to test.

I've already tried to pull it by span, class, CSS constructor and others.

2 Answers2

0

The data is stored inside the page in <script> element (in Json form). To load it you can use next example:

import json
import requests
from bs4 import BeautifulSoup

def get_investing_direct_url(url):
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    data = json.loads(soup.select_one('#__NEXT_DATA__').text)
    data = json.loads(data['props']['pageProps']['state'])
    data = json.loads(data['dataStore']['indexStore'])
    # print(json.dumps(data, indent=4))

    return data['instrument']['price']['changePcr']


SP = get_investing_direct_url('https://br.investing.com/indices/us-spx-500-futures?cid=1175153')
print(SP)

Prints:

-0.31
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
0

You are not receiving all of the elements in this part.

response = htx.get(url, headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}, timeout=(20.0, 20.0))
bs = BeautifulSoup(response.text, 'lxml')

I think the price_change-percent element is parsing by the javascrit. So you can only get the percent price by selenium.