0

I am having trouble scraping ESPN Gamecast links from the espn scoreboard webpage. I have tried:

site = "https://www.espn.com/mlb/scoreboard"

html = requests.get(site).text

soup = BeautifulSoup(html, 'html.parser').find_all('a')

links = [link.get('href') for link in soup]

but the links are not being recognized.

Stephen
  • 3
  • 3
  • What output you are getting None?? may be site is dynamic try to print `soup` first and look up is there any href or not and also mention your desired output – Bhavya Parikh Jun 24 '21 at 13:13
  • I get an output it just doesn't have the links I am looking for. I am looking for the gameID links like (http://www.espn.com/mlb/game/_/gameId/401228181). I am not that familiar with HTML but the link shows up in the soup like below. "links":[{"isExternal":false,"shortText":"Gamecast","rel":["summary","desktop","event"],"language":"en-US","href":"http://www.espn.com/mlb/game/_/gameId/401228181" – Stephen Jun 25 '21 at 12:01

2 Answers2

1

Would it be the case that you missed out on the quotation marks? I have tried the following and could produce the output.

site = 'https://www.espn.com/mlb/scoreboard/_/date/20210624'
html = requests.get(site).text
soup = BeautifulSoup(html, 'html.parser').find_all('a')
links = [link.get('href') for link in soup]
print(links)
  • Thank you but that still doesn't pick up all of the game links. Only one gets picked up that I am looking for " /mlb/preview?gameId=401228185". I don't understand why it doesn't recognize all of them – Stephen Jun 25 '21 at 12:09
0

It's loaded dynamically so you need to either a) use somethinging like Selenium that allows the page to render before parsing with bs4, or b) go straight to the data source/api. Api is often the best option:

import requests

api = 'http://site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard'

jsonData = requests.get(api).json()
events = jsonData['events']

links = []
for event in events:
    event_links = event['links']
    for each in event_links:
        if each['text'] == 'Gamecast':
            links.append(each['href'])

Ouput:

print(links)
['http://www.espn.com/mlb/game/_/gameId/401228229', 'http://www.espn.com/mlb/game/_/gameId/401228235', 'http://www.espn.com/mlb/game/_/gameId/401228242', 'http://www.espn.com/mlb/game/_/gameId/401228240', 'http://www.espn.com/mlb/game/_/gameId/401228233', 'http://www.espn.com/mlb/game/_/gameId/401228234', 'http://www.espn.com/mlb/game/_/gameId/401228239', 'http://www.espn.com/mlb/game/_/gameId/401228237', 'http://www.espn.com/mlb/game/_/gameId/401228231', 'http://www.espn.com/mlb/game/_/gameId/401228232', 'http://www.espn.com/mlb/game/_/gameId/401228236', 'http://www.espn.com/mlb/game/_/gameId/401228230', 'http://www.espn.com/mlb/game/_/gameId/401228238', 'http://www.espn.com/mlb/game/_/gameId/401228243', 'http://www.espn.com/mlb/game/_/gameId/401228241']
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • Thanks so much, that seems to work. Do you have any recommended resources to learn more about api's? – Stephen Jun 30 '21 at 15:19
  • Apis will return data in json format. So in regards to python, google who to work with json structures in python (basically just understand dictionaries and lists). Then as far as finding the urls, you find them in browser dev tools (Network -> XHR). Espn is a little tricky as their apis are “hidden”...again a google search helped. Best advice is just practice scraping different sites to get a feel for it. MLB.com has some apis you can play with. Try clicking around there, with dev tools opened, so you can see the url/requests being made. – chitown88 Jul 01 '21 at 04:24
  • does anyone know how to get the MLB schedule? – Learn2Code Dec 27 '22 at 21:42