Having Trouble Webscraping Cronometer.com using BeautifulSoup

Question

I'm very new to Python, but using a few different online guides I've managed to stitch together some code that logs me into a website called cronometer.com (health tracking website/app, similar to myfitnesspal). Unfortunately, I'm having trouble actually scraping any data.

I have the following code (ignore the Hass/AppDaemon, I'm running this python script in Home Assistant):

import appdaemon.plugins.hass.hassapi as hass
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import requests

class Scraper(hass.Hass):

  def initialize(self):
    self.log("Scraper Initialized")
    self.get_values(self)

  def get_values(self,kwargs):
    self.login_url = "https://cronometer.com/login/"
    self.r = requests.get(self.login_url)
    self.bs = BeautifulSoup(self.r.text, 'html.parser')
    self.csrf_token = self.bs.find('input', attrs={'name': 'anticsrf'})['value']
    self.url = "https://cronometer.com/"
    self.session = requests.Session()
    self.payload = {
        "username": "MY_USERNAME",
        "password": "MY_PASSWORD",
        "anticsrf": self.csrf_token
    }
    self.headers = {'referer': self.login_url, 'User-agent': 'Chrome'}
    self.sensorname = "sensor.scraper"
    self.friendly_name = "Fasting Status"
    
    try:
      s = self.session.post(self.login_url, data=self.payload, headers=self.headers, cookies=self.r.cookies)
    except:
      self.log("Could not log in")
      return
    
    self.log(self.csrf_token)
    s = self.session.get(self.url)
    page = s.content
    soup = BeautifulSoup(page, "html.parser")

    # Test 1
    fasting1 = soup.select('#cronometerApp > div:nth-child(2) > div:nth-child(1) > div > table > tbody > tr > td:nth-child(1) > div > div:nth-child(8) > div > div.diary-item-title > div')
    self.log("TEST 1")
    self.log(fasting1)

    # Test 2
    fasting2 = soup.select('#cronometerApp > div:nth-child(2) > div:nth-child(1) > div > table > tbody > tr > td:nth-child(1) > div > div:nth-child(8) > div > div.diary-item-content > div.GJES3IWDERB')
    self.log("TEST 2")
    self.log(fasting2)

    # Test 3
    fasting3 = soup.select('#w-node-dd7aab6f-acfc-dfa1-2372-313b5d39fc2b-0dd15747 > div.div__mobile__features-text-1 > h5')
    self.log("TEST 3")
    self.log(fasting3)

    # Test 4
    fasting4 = soup.select('#cronometerApp > div:nth-child(2) > div:nth-child(1) > div > table > tbody > tr > td:nth-child(2) > div > div.GJES3IWDHFD > button:nth-child(1) > span')
    self.log("TEST 4")
    self.log(fasting4)

    # Test 5
    fasting5 = soup.select('#cronometerApp > div:nth-child(2) > div:nth-child(1) > div > table > tbody > tr > td:nth-child(2) > div > div.diary_side_box.GJES3IWDIQB > div.GJES3IWDKQB > div > div.GJES3IWDITE > table > tbody > tr > td > div:nth-child(1) > span')
    self.log("TEST 5")
    self.log(fasting5)

    self.set_state(self.sensorname, state= "Test", attributes = {"friendly_name": self.friendly_name})

From what I can tell, this code successfully logs into cronometer.com with no issues. The problem is (I think) the URL for my personal homepage is the same URL for the website before logging in. So after using session.post to send my credentials to the website, I'm using session.get to scrape data from my "profile". But it's only scraping data from the normal cronometer.com webpage (before you login), not my own personal webpage with the same URL.

One thing I did notice is that the URL does change slightly when I click on the tabs at the top, as you can see here:

When I click on Diary, the URL changes from cronometer.com to cronometer.com/#diary, and Trends is cronometer.com/#trends, so on and so forth. But using those specific URLs is not proving fruitful either.

Again, sorry for my lack of knowledge, but how can I overcome this issue? I've tried looking at some online guides about Selenium, but so far I haven't been able to make sense of how I could use Selenium to log in when the issue isn't necessarily logging in (I don't think), but scraping the right webpage. Thanks in advance for your help.

You could try using `session` also for the login request – Martin Evans Jun 30 '21 at 08:22 — Martin Evans, Jun 30 '21 at 08:22

score 2 · Answer 1 · answered Mar 02 '22 at 14:39

You are using the requests module, which is an amazing tool for scraping static/server-side rendered content.

Cronometer, however, is a javascript app. If you disable the javascript and try to load cronometer, you will be greeted by the "Your web browser must have javascript enabled" message.

This is basically what your requests call will see as well.

Scraping websites like these is an easy task with tools like requests-html module and selenium.

I personally like selenium as it is extremely easy to use, and you can actually see what the script is doing in a chrome browser in real-time.

I wrote a piece of code that logs into cronometer and scrapes the daily energy value. I've added comments explaining what every line does.

import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
# if you set this to true the chrome window will not be displayed 
options.headless = False
driver = webdriver.Chrome(options=options)

URL = 'https://cronometer.com/login/'
USERNAME = ''
PASSWORD = ''

# navigate to cronometer
driver.get(URL)

# fill inputs
driver.find_element(by=By.NAME, value='username').send_keys(USERNAME)
driver.find_element(by=By.NAME, value='password').send_keys(PASSWORD)

# click on the login button
driver.find_element(by=By.ID, value='login-button').click()

# wait until the daily energy bar loads, or skip if 10 seconds have passed
timeout = 10
expectation = EC.element_to_be_clickable((By.CSS_SELECTOR, '.nutrientTargetBar-text'))
nutrients_element = WebDriverWait(driver, timeout).until(expectation)

# print daily energy bar text
print(nutrients_element.text)

Having Trouble Webscraping Cronometer.com using BeautifulSoup

1 Answers1