I have a piece of code that is parsing a JSON looking object embedded in the source code of this page:
The code being used to do this is:
import json
import requests
import re
url = 'http://www.whoscored.com/Matches/829726/Live/England-Premier-League-2014-2015-Stoke-Manchester-United'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Host': 'www.whoscored.com',
'Referer': 'http://www.whoscored.com/'}
r = requests.get(url, headers=headers)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content)
data_cen = re.compile('var matchCentreData = ({.*?})')
event_type = re.compile('var matchCentreEventTypeJson = ({.*?})')
data = soup.find("a", href="/ContactUs").find_next("script").text
d = json.dumps(data_cen.search(data).group(1))
e = json.dumps(event_type.search(data).group(1))
data_dict = json.loads(d)
event_dict = json.loads(e)
print(event_dict)
print(data_dict)
When I run this on a Windows platform, either in Python 2.7 IDLE, Command Shell or iPython, the whole object being parsed (it is extremely long) is not returned. The returned object seems to cut off about half way down.
This causes a noneType
error for the line e = json.dumps(event_type.search(data).group(1))
as this is right near the bottom of the data structure and not printed to screen on my operating system. Another user has run this code on a Linux based operating system and not had the same problem.
Is anyone aware of any Windows specific issues around the maximum length of returned strings? Or can anyone else thing of any other issues that could be causing this to happen in Windows but not in Linux?
Thanks