0

I am attempting to scrape data from NBA.com using Python, but I do not receive a response after waiting for a reasonable amount of time when I run my code (shown below).

import requests
import json

url_front = 'http://stats.nba.com/stats/leaguedashplayerstats?College=&' + \
            'Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&' + \
            'DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&' + \
            'Location=&MeasureType='
url_back = '&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&' + \
           'PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&' + \
           'PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&' + \
           'SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&' + \
           'VsConference=&VsDivision=&Weight='
#measure_type = ['Base','Advanced','Misc','Scoring','Opponent','Usage','Defense']
measure_type = 'Base'
address = url_front + measure_type + url_back

# Request the URL, then parse the JSON.
response = requests.get(address)
response.raise_for_status()         # Raise exception if invalid   response. 
data = response.json()              # JSON decoding. 

So far, I have attempted to reproduce code from blog posts (here) and/or questions posted on this site (Python, R) that are similar in nature, but I end up with the same result each time - the code does not actually succeed in pulling anything from the URL.

Since I am new to web scraping, I was hoping for assistance with troubleshooting the issue - is this common to sites with client-side rendering (NBA.com), or is it indicative of an issue with my code/computer? In either case, are there common workarounds/solutions?

Community
  • 1
  • 1
  • 1
    Have you tried going to that url in a browser? It has a message saying 'MeasureType is required' – Zac Faragher Apr 23 '17 at 22:40
  • The link should work in a browser - try [this](http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=) in case you're still interested. – chbonfield Apr 24 '17 at 11:57

1 Answers1

0

If you visit the link in your browser you'll notice it works fine. The reason is that the browser and requests have different user agent headers, and the site specifically blocks HTTP requests that don't look like they come from browsers because they don't want to be scraped. You can bypass this like so:

response = requests.get(address, headers={
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0',
})

Keep this in mind and don't overload their servers.

Alex Hall
  • 34,833
  • 5
  • 57
  • 89
  • Thanks for the code/discussion - that makes a lot of sense. Is there a way to tell which sites may require additional information in `requests` (such as `'User-Agent`, or some other header), or is it better practice to just provide them regardless? – chbonfield Apr 24 '17 at 11:56
  • @chbonfield trial and error. The more resources and motivation available to stop people from scraping, the more checks there'll be in place, and it's not as simple as information to provide in a request. For example sites will generally suspect a bot if requests are made too quickly. And eventually sites may require a captcha. – Alex Hall Apr 24 '17 at 12:02