-2

I'm trying to scrape fundraising info using BeautifulSoup, and am running into trouble trying to isolate elements like the amount raised towards a fundraising goal.

Here is the code so far:

from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from time import sleep
import requests
import re
import json

page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")
soup = BeautifulSoup(page.text, 'lxml')
Amount_raised = soup.find_all('h2', class_='m-progress-meter-heading')[0].get_text()

The code works, but when I view the result, it looks like this:

print(Amount_raised)
882,521 $ raised of 1,000,000 $ goal

Ideally, I would like to have just the number '882,521' returned or, even better, parse these into two variables, one with the amount raised and another with the fundraising goal.

I feel like there should be a way to either specify which element I want, or use regular expressions to isolate it, but my searches haven't been fruitful and I'm fairly new to python.

Edit: this is the section of HTML I am trying to work with

<h2 class="m-progress-meter-heading">882,521 $<!-- --> <span class="text-stat text-stat-title">raised of 1,000,000 $ goal</span>
RJames
  • 107
  • 1
  • 5

3 Answers3

1

Easiest way I found to do this is:

Amount_raised = soup.find_all('h2', class_='m-progress-meter-heading')
print(Amount_raised[0].contents[0])

prints $882,521


Found the solution here: Only extracting text from this element, not its children
John
  • 69
  • 3
0

You can work with text which you get

Amount_raised.split(" ")[0]

Full code:

from bs4 import BeautifulSoup
import requests

page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")
soup = BeautifulSoup(page.text, 'lxml')

Amount_raised = soup.find_all('h2', class_='m-progress-meter-heading')[0].get_text()
print(Amount_raised.split(" ")[0])

You can also skip .get_text() and then you can find and remove tag <span> from <h2> (using .extrude()) and next you can use .get_text() to get text from <h2>

item = soup.find_all('h2', class_='m-progress-meter-heading')[0]
item.find('span').extrude()
Amount_raised = item.get_text()

Full code:

from bs4 import BeautifulSoup
import requests

page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")
soup = BeautifulSoup(page.text, 'lxml')

item = soup.find_all('h2', class_='m-progress-meter-heading')[0]
item.find('span').extract()
Amount_raised = item.get_text()
print(Amount_raised)

You can also get list with all strings in <h2> and then text from <span> will be as separated element on list

item = soup.find_all('h2', class_='m-progress-meter-heading')[0]
print( list(item.strings)[0] )

Full code:

from bs4 import BeautifulSoup
import requests

page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")
soup = BeautifulSoup(page.text, 'lxml')

item = soup.find_all('h2', class_='m-progress-meter-heading')[0]
print(list(item.strings)[0])

EDIT: other examples:

item = soup.find_all('h2', class_='m-progress-meter-heading')[0]

print( item.next )
print( list(item.children)[0] )
furas
  • 134,197
  • 12
  • 106
  • 148
0

If you want to get both the goal and the amount actually raised, try:

amts = Amount_raised.split(' ')
locs = [i for i, x in enumerate(amts) if  "$" in x]
print('Amount raised: $'+amts[locs[0]-1])
print('Goal : $'+amts[locs[1]-1])

Output:

Amount raised: $882,521
Goal : $1,000,000
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45