0

I am trying to parse the item called matchCentreData that can be found within the source code at the following page:

http://www.whoscored.com/Matches/829726/Live/England-Premier-League-2014-2015-Stoke-Manchester-United

Because there are no XHR requests involved on this page and the data item is buried in the page source code itself, I am unsure of how to parse this item using anything other than a regex.

Because the data structure is deeply nested, I am trying to break it down into several sub components to parse individually. Here is my code, to try an parse the first sub component, playerIdNameDictionaryonly:

import json
import simplejson
import requests
import jsonobject
import time
import re

url = 'http://www.whoscored.com/Matches/829726/Live/England-Premier-League-2014-2015-Stoke-Manchester-United'
params = {}

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
           'X-Requested-With': 'XMLHttpRequest',
           'Host': 'www.whoscored.com',
           'Referer': 'http://www.whoscored.com/'}


responser = requests.get(url, params=params, headers=headers)

regex = re.compile("matchCentreData = \{.*?\};", re.S)
match = re.search(regex, responser.text)
match2 = match.group()

match3 = match2[u'playerIdNameDictionary']
print match3

This however produces the following error:

Traceback (most recent call last):
  File "C:\Python27\counter.py", line 23, in <module>
    match3 = match2[u'playerIdNameDictionary']
TypeError: string indices must be integers

I am presuming that this is because the item I am returning is a string, rather than a JSON object. What I want to know is:

1) Am I correct in my diagnoses of the problem as stated in the above sentence? 2) How can I parse the JSON/javascript object matchCentreData without using a regex?

I hope my question makes sense.

Thanks

gdogg371
  • 3,879
  • 14
  • 63
  • 107
  • what exactly are you trying to get? – Padraic Cunningham Jan 04 '15 at 00:18
  • @PadraicCunningham hello again. most of this site seems to function on XHR requests, but annoyingly some pages dont and the JSON esque object is embedded in the source code of the page. in an instance such as this i dont know any other way of parsing the object, in this case called 'matchCentreData' than using a regex. i want to know how i can reference this object as a json/javascript item within the source code and then known how to reference sub components of 'matchCentreData'. The first sub component for example is called 'playerIdNameDictionary'. let me know if that doesnt make sense... – gdogg371 Jan 04 '15 at 00:24

2 Answers2

0

match2 is just a string, not a json object. You can use match2 = json.loads(match2) to convert the string to a json object. Please wrap the json.loads call in a try/catch block to catch errors within the source json.

More about json.loads(): https://docs.python.org/2/library/json.html


As I stated in the comments below, your regexp is a bit too loose. It'll start to match when it finds var matchCentreData = { ... but it'll continue to match until the very last json blob in response.text is finished. That's not something json.loads can handle. I've changed the code to this:

>>> regex = re.compile("var matchCentreData = (\{.+\});\r\n        var matchCentreEventTypeJson", re.S)
>>> match = re.search(regex, response.text)
>>> # now match.groups(1)[0] will contain the match centre data json blob
>>> match_centre_data = json.loads(match.groups(1)[0])
>>> match_centre_data['playerIdNameDictionary']['34693']
'Marko Arnautovic'

Please note that this form of coding is very fragile and it'll likely break when whoscores.com updates their site.

Bjorn
  • 5,272
  • 1
  • 24
  • 35
  • hi there, thanks for replying. your response i have already tried. i get an error saying that 'ValueError: No JSON object could be decoded'. – gdogg371 Jan 04 '15 at 00:08
  • In that case, check what match.group() returns. I guess there's an error in your regular expression. Check that you're only capturing `var matchCentreData = -->{ ... }<--;\r\nvar matchId = ...` that part. The hard part is writing a regular expression that will stop matching when it finds another var with json data. – Bjorn Jan 04 '15 at 00:24
0

Youths can use beautifulsoup to extract the script:

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content)
data_cen = re.compile('var matchCentreData = ({.*?})')
data = soup.find("script",text=data_cen).text
d = json.dumps(data_cen.search(data).group(1))
data_dict  = (json.loads(d))
{"playerIdNameDictionary":{"34693":"Marko Arnautovic","23122":"Asmir Begovic","39935":"Steven N'Zonzi","4145":"Robert Huth","3860":"Jonathan Walters","23446":"Marc Wilson","8505":"Glenn Whelan","29762":"Oussama Assaidi","24148":"Erik Pieters","26013":"Mame Biram Diouf","75177":"Marc Muniesa","38772":"Geoff Cameron","107395":"Jack Butland","29798":"Ryan Shawcross","3807":"Peter Crouch","8327":"Charlie Adam","18181":"Phil Bardsley","254558":"Oliver Shenton","130334":"Adnan Januzaj","4092":"Rafael","18701":"Falcao","10620":"Anders Lindegaard","4564":"Robin van Persie","25363":"Juan Mata","71174":"Ander Herrera","79554":"David de Gea","2115":"Michael Carrick","3859":"Wayne Rooney","8166":"Ashley Young","81726":"Phil Jones","118244":"Luke Shaw","137795":"Tyler Blackett","145271":"James Wilson","71345":"Chris Smalling","5835":"Darren Fletcher","22079":"Jonny Evans"}

You can also find the script using find_next and similar regex to extract the required data:

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content)
data_cen = re.compile('var matchCentreData = ({.*?})')
event_type = re.compile('var matchCentreEventTypeJson = ({.*?})')

data = soup.find("a", href="/ContactUs").find_next("script").text
d = json.dumps(data_cen.search(data).group(1))
e = json.dumps(event_type.search(data).group(1))

data_dict = json.loads(d)
event_dict = json.loads(e)

{"playerIdNameDictionary":{"34693":"Marko Arnautovic","23122":"Asmir Begovic","39935":"Steven N'Zonzi","4145":"Robert Huth","3860":"Jonathan Walters","23446":"Marc Wilson","8505":"Glenn Whelan","29762":"Oussama Assaidi","24148":"Erik Pieters","26013":"Mame Biram Diouf","75177":"Marc Muniesa","38772":"Geoff Cameron","107395":"Jack Butland","29798":"Ryan Shawcross","3807":"Peter Crouch","8327":"Charlie Adam","18181":"Phil Bardsley","254558":"Oliver Shenton","130334":"Adnan Januzaj","4092":"Rafael","18701":"Falcao","10620":"Anders Lindegaard","4564":"Robin van Persie","25363":"Juan Mata","71174":"Ander Herrera","79554":"David de Gea","2115":"Michael Carrick","3859":"Wayne Rooney","8166":"Ashley Young","81726":"Phil Jones","118244":"Luke Shaw","137795":"Tyler Blackett","145271":"James Wilson","71345":"Chris Smalling","5835":"Darren Fletcher","22079":"Jonny Evans"}
{"shotSixYardBox":0,"shotPenaltyArea":1,"shotOboxTotal":2,"shotOpenPlay":3,"shotCounter":4,"shotSetPiece":5,"shotOffTarget":6,"shotOnPost":7,"shotOnTarget":8,"shotsTotal":9,"shotBlocked":10,"shotRightFoot":11,"shotLeftFoot":12,"shotHead":13,"shotObp":14,"goalSixYardBox":15,"goalPenaltyArea":16,"goalObox":17,"goalOpenPlay":18,"goalCounter":19,"goalSetPiece":20,"penaltyScored":21,"goalOwn":22,"goalNormal":23,"goalRightFoot":24,"goalLeftFoot":25,"goalHead":26,"goalObp":27,"shortPassInaccurate":28,"shortPassAccurate":29,"passCorner":30,"passCornerAccurate":31,"passCornerInaccurate":32,"passFreekick":33,"passBack":34,"passForward":35,"passLeft":36,"passRight":37,"keyPassLong":38,"keyPassShort":39,"keyPassCross":40,"keyPassCorner":41,"keyPassThroughball":42,"keyPassFreekick":43,"keyPassThrowin":44,"keyPassOther":45,"assistCross":46,"assistCorner":47,"assistThroughball":48,"assistFreekick":49,"assistThrowin":50,"assistOther":51,"dribbleLost":52,"dribbleWon":53,"challengeLost":54,"interceptionWon":55,"clearanceHead":56,"outfielderBlock":57,"passCrossBlockedDefensive":58,"outfielderBlockedPass":59,"offsideGiven":60,"offsideProvoked":61,"foulGiven":62,"foulCommitted":63,"yellowCard":64,"voidYellowCard":65,"secondYellow":66,"redCard":67,"turnover":68,"dispossessed":69,"saveLowLeft":70,"saveHighLeft":71,"saveLowCentre":72,"saveHighCentre":73,"saveLowRight":74,"saveHighRight":75,"saveHands":76,"saveFeet":77,"saveObp":78,"saveSixYardBox":79,"savePenaltyArea":80,"saveObox":81,"keeperDivingSave":82,"standingSave":83,"closeMissHigh":84,"closeMissHighLeft":85,"closeMissHighRight":86,"closeMissLeft":87,"closeMissRight":88,"shotOffTargetInsideBox":89,"touches":90,"assist":91,"ballRecovery":92,"clearanceEffective":93,"clearanceTotal":94,"clearanceOffTheLine":95,"dribbleLastman":96,"errorLeadsToGoal":97,"errorLeadsToShot":98,"intentionalAssist":99,"interceptionAll":100,"interceptionIntheBox":101,"keeperClaimHighLost":102,"keeperClaimHighWon":103,"keeperClaimLost":104,"keeperClaimWon":105,"keeperOneToOneWon":106,"parriedDanger":107,"parriedSafe":108,"collected":109,"keeperPenaltySaved":110,"keeperSaveInTheBox":111,"keeperSaveTotal":112,"keeperSmother":113,"keeperSweeperLost":114,"keeperMissed":115,"passAccurate":116,"passBackZoneInaccurate":117,"passForwardZoneAccurate":118,"passInaccurate":119,"passAccuracy":120,"cornerAwarded":121,"passKey":122,"passChipped":123,"passCrossAccurate":124,"passCrossInaccurate":125,"passLongBallAccurate":126,"passLongBallInaccurate":127,"passThroughBallAccurate":128,"passThroughBallInaccurate":129,"passThroughBallInacurate":130,"passFreekickAccurate":131,"passFreekickInaccurate":132,"penaltyConceded":133,"penaltyMissed":134,"penaltyWon":135,"passRightFoot":136,"passLeftFoot":137,"passHead":138,"sixYardBlock":139,"tackleLastMan":140,"tackleLost":141,"tackleWon":142,"cleanSheetGK":143,"cleanSheetDL":144,"cleanSheetDC":145,"cleanSheetDR":146,"cleanSheetDML":147,"cleanSheetDMC":148,"cleanSheetDMR":149,"cleanSheetML":150,"cleanSheetMC":151,"cleanSheetMR":152,"cleanSheetAML":153,"cleanSheetAMC":154,"cleanSheetAMR":155,"cleanSheetFWL":156,"cleanSheetFW":157,"cleanSheetFWR":158,"cleanSheetSub":159,"goalConcededByTeamGK":160,"goalConcededByTeamDL":161,"goalConcededByTeamDC":162,"goalConcededByTeamDR":163,"goalConcededByTeamDML":164,"goalConcededByTeamDMC":165,"goalConcededByTeamDMR":166,"goalConcededByTeamML":167,"goalConcededByTeamMC":168,"goalConcededByTeamMR":169,"goalConcededByTeamAML":170,"goalConcededByTeamAMC":171,"goalConcededByTeamAMR":172,"goalConcededByTeamFWL":173,"goalConcededByTeamFW":174,"goalConcededByTeamFWR":175,"goalConcededByTeamSub":176,"goalConcededOutsideBoxGoalkeeper":177,"goalScoredByTeamGK":178,"goalScoredByTeamDL":179,"goalScoredByTeamDC":180,"goalScoredByTeamDR":181,"goalScoredByTeamDML":182,"goalScoredByTeamDMC":183,"goalScoredByTeamDMR":184,"goalScoredByTeamML":185,"goalScoredByTeamMC":186,"goalScoredByTeamMR":187,"goalScoredByTeamAML":188,"goalScoredByTeamAMC":189,"goalScoredByTeamAMR":190,"goalScoredByTeamFWL":191,"goalScoredByTeamFW":192,"goalScoredByTeamFWR":193,"goalScoredByTeamSub":194,"aerialSuccess":195,"duelAerialWon":196,"duelAerialLost":197,"offensiveDuel":198,"defensiveDuel":199,"bigChanceMissed":200,"bigChanceScored":201,"bigChanceCreated":202,"overrun":203,"successfulFinalThirdPasses":204,"punches":205,"penaltyShootoutScored":206,"penaltyShootoutMissedOffTarget":207,"penaltyShootoutSaved":208,"penaltyShootoutSavedGK":209,"penaltyShootoutConcededGK":210,"throwIn":211,"subOn":212,"subOff":213,"defensiveThird":214,"midThird":215,"finalThird":216,"pos":217}

Full code:

import json
import requests
import re

url = 'http://www.whoscored.com/Matches/829726/Live/England-Premier-League-2014-2015-Stoke-Manchester-United'

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
           'X-Requested-With': 'XMLHttpRequest',
           'Host': 'www.whoscored.com',
           'Referer': 'http://www.whoscored.com/'}


r = requests.get(url,  headers=headers)


from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content)
data_cen = re.compile('var matchCentreData = ({.*?})')
event_type = re.compile('var matchCentreEventTypeJson = ({.*?})')

data = soup.find("a", href="/ContactUs").find_next("script").text
d = json.dumps(data_cen.search(data).group(1))
e = json.dumps(event_type.search(data).group(1))

data_dict = json.loads(d)
event_dict = json.loads(e)
print(event_dict)
print(data_dict)

{"shotSixYardBox":0,"shotPenaltyArea":1,"shotOboxTotal":2,"shotOpenPlay":3,"shotCounter":4,"shotSetPiece":5,"shotOffTarget":6,"shotOnPost":7,"shotOnTarget":8,"shotsTotal":9,"shotBlocked":10,"shotRightFoot":11,"shotLeftFoot":12,"shotHead":13,"shotObp":14,"goalSixYardBox":15,"goalPenaltyArea":16,"goalObox":17,"goalOpenPlay":18,"goalCounter":19,"goalSetPiece":20,"penaltyScored":21,"goalOwn":22,"goalNormal":23,"goalRightFoot":24,"goalLeftFoot":25,"goalHead":26,"goalObp":27,"shortPassInaccurate":28,"shortPassAccurate":29,"passCorner":30,"passCornerAccurate":31,"passCornerInaccurate":32,"passFreekick":33,"passBack":34,"passForward":35,"passLeft":36,"passRight":37,"keyPassLong":38,"keyPassShort":39,"keyPassCross":40,"keyPassCorner":41,"keyPassThroughball":42,"keyPassFreekick":43,"keyPassThrowin":44,"keyPassOther":45,"assistCross":46,"assistCorner":47,"assistThroughball":48,"assistFreekick":49,"assistThrowin":50,"assistOther":51,"dribbleLost":52,"dribbleWon":53,"challengeLost":54,"interceptionWon":55,"clearanceHead":56,"outfielderBlock":57,"passCrossBlockedDefensive":58,"outfielderBlockedPass":59,"offsideGiven":60,"offsideProvoked":61,"foulGiven":62,"foulCommitted":63,"yellowCard":64,"voidYellowCard":65,"secondYellow":66,"redCard":67,"turnover":68,"dispossessed":69,"saveLowLeft":70,"saveHighLeft":71,"saveLowCentre":72,"saveHighCentre":73,"saveLowRight":74,"saveHighRight":75,"saveHands":76,"saveFeet":77,"saveObp":78,"saveSixYardBox":79,"savePenaltyArea":80,"saveObox":81,"keeperDivingSave":82,"standingSave":83,"closeMissHigh":84,"closeMissHighLeft":85,"closeMissHighRight":86,"closeMissLeft":87,"closeMissRight":88,"shotOffTargetInsideBox":89,"touches":90,"assist":91,"ballRecovery":92,"clearanceEffective":93,"clearanceTotal":94,"clearanceOffTheLine":95,"dribbleLastman":96,"errorLeadsToGoal":97,"errorLeadsToShot":98,"intentionalAssist":99,"interceptionAll":100,"interceptionIntheBox":101,"keeperClaimHighLost":102,"keeperClaimHighWon":103,"keeperClaimLost":104,"keeperClaimWon":105,"keeperOneToOneWon":106,"parriedDanger":107,"parriedSafe":108,"collected":109,"keeperPenaltySaved":110,"keeperSaveInTheBox":111,"keeperSaveTotal":112,"keeperSmother":113,"keeperSweeperLost":114,"keeperMissed":115,"passAccurate":116,"passBackZoneInaccurate":117,"passForwardZoneAccurate":118,"passInaccurate":119,"passAccuracy":120,"cornerAwarded":121,"passKey":122,"passChipped":123,"passCrossAccurate":124,"passCrossInaccurate":125,"passLongBallAccurate":126,"passLongBallInaccurate":127,"passThroughBallAccurate":128,"passThroughBallInaccurate":129,"passThroughBallInacurate":130,"passFreekickAccurate":131,"passFreekickInaccurate":132,"penaltyConceded":133,"penaltyMissed":134,"penaltyWon":135,"passRightFoot":136,"passLeftFoot":137,"passHead":138,"sixYardBlock":139,"tackleLastMan":140,"tackleLost":141,"tackleWon":142,"cleanSheetGK":143,"cleanSheetDL":144,"cleanSheetDC":145,"cleanSheetDR":146,"cleanSheetDML":147,"cleanSheetDMC":148,"cleanSheetDMR":149,"cleanSheetML":150,"cleanSheetMC":151,"cleanSheetMR":152,"cleanSheetAML":153,"cleanSheetAMC":154,"cleanSheetAMR":155,"cleanSheetFWL":156,"cleanSheetFW":157,"cleanSheetFWR":158,"cleanSheetSub":159,"goalConcededByTeamGK":160,"goalConcededByTeamDL":161,"goalConcededByTeamDC":162,"goalConcededByTeamDR":163,"goalConcededByTeamDML":164,"goalConcededByTeamDMC":165,"goalConcededByTeamDMR":166,"goalConcededByTeamML":167,"goalConcededByTeamMC":168,"goalConcededByTeamMR":169,"goalConcededByTeamAML":170,"goalConcededByTeamAMC":171,"goalConcededByTeamAMR":172,"goalConcededByTeamFWL":173,"goalConcededByTeamFW":174,"goalConcededByTeamFWR":175,"goalConcededByTeamSub":176,"goalConcededOutsideBoxGoalkeeper":177,"goalScoredByTeamGK":178,"goalScoredByTeamDL":179,"goalScoredByTeamDC":180,"goalScoredByTeamDR":181,"goalScoredByTeamDML":182,"goalScoredByTeamDMC":183,"goalScoredByTeamDMR":184,"goalScoredByTeamML":185,"goalScoredByTeamMC":186,"goalScoredByTeamMR":187,"goalScoredByTeamAML":188,"goalScoredByTeamAMC":189,"goalScoredByTeamAMR":190,"goalScoredByTeamFWL":191,"goalScoredByTeamFW":192,"goalScoredByTeamFWR":193,"goalScoredByTeamSub":194,"aerialSuccess":195,"duelAerialWon":196,"duelAerialLost":197,"offensiveDuel":198,"defensiveDuel":199,"bigChanceMissed":200,"bigChanceScored":201,"bigChanceCreated":202,"overrun":203,"successfulFinalThirdPasses":204,"punches":205,"penaltyShootoutScored":206,"penaltyShootoutMissedOffTarget":207,"penaltyShootoutSaved":208,"penaltyShootoutSavedGK":209,"penaltyShootoutConcededGK":210,"throwIn":211,"subOn":212,"subOff":213,"defensiveThird":214,"midThird":215,"finalThird":216,"pos":217}
{"playerIdNameDictionary":{"34693":"Marko Arnautovic","23122":"Asmir Begovic","39935":"Steven N'Zonzi","4145":"Robert Huth","3860":"Jonathan Walters","23446":"Marc Wilson","8505":"Glenn Whelan","29762":"Oussama Assaidi","24148":"Erik Pieters","26013":"Mame Biram Diouf","75177":"Marc Muniesa","38772":"Geoff Cameron","107395":"Jack Butland","29798":"Ryan Shawcross","3807":"Peter Crouch","8327":"Charlie Adam","18181":"Phil Bardsley","254558":"Oliver Shenton","130334":"Adnan Januzaj","4092":"Rafael","18701":"Falcao","10620":"Anders Lindegaard","4564":"Robin van Persie","25363":"Juan Mata","71174":"Ander Herrera","79554":"David de Gea","2115":"Michael Carrick","3859":"Wayne Rooney","8166":"Ashley Young","81726":"Phil Jones","118244":"Luke Shaw","137795":"Tyler Blackett","145271":"James Wilson","71345":"Chris Smalling","5835":"Darren Fletcher","22079":"Jonny Evans"}
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • in the example above i get an error that 'r' is not defined. is that meant to be something else? in your second example, i am confused as to what the line starting 'data = ' is doing? – gdogg371 Jan 04 '15 at 01:30
  • 1
    data is finding the next script tag after the a tag that contains the contact us info. If you look at the source you will see it is just before the script you want. – Padraic Cunningham Jan 04 '15 at 01:40
  • ok, this is nearly there..the second regex, 'event_type' is not finding a match though. does it need a '\n' adding as the data object is on a new line from 'matchCentreEventTypeJson ='? thanks. – gdogg371 Jan 04 '15 at 01:52
  • I added the full code I am using, it returns the two dicts you see – Padraic Cunningham Jan 04 '15 at 02:02
  • Traceback (most recent call last): File "C:\Python27\counter.py", line 23, in e = json.dumps(event_type.search(data).group(1)) AttributeError: 'NoneType' object has no attribute 'group' – gdogg371 Jan 04 '15 at 02:10
  • ok, i can see why there is a nonetype being returned. 'data' seems to randomly stop at the line '"playerId":118244,"x":15.0,"y":89.0', which is about half way down the first object being parsed. – gdogg371 Jan 04 '15 at 02:24
  • again that is weird, what happens using `data = soup.find("script",text=data_cen).text` – Padraic Cunningham Jan 04 '15 at 02:26
  • same result again unfortunately...ive tried running the script in both python IDLE and command shell to make sure that it isnt some sort of issue around how long the variable data is...i could try running it ipython i suppose and see what happens then – gdogg371 Jan 04 '15 at 02:31
  • ...hmmm, im wondering if it is a variable length issue of some sort in windows...can you suggest any way of testing that theory? – gdogg371 Jan 04 '15 at 02:36
  • ok, thanks. i will have another look at this tomorrow and see if i can think of a way around it. – gdogg371 Jan 04 '15 at 02:50
  • I will have another look myself too. Too late for full brain activity here! – Padraic Cunningham Jan 04 '15 at 02:54
  • @gdogg371, what are you running this on? Also what version of requests are you using? – Padraic Cunningham Jan 04 '15 at 18:29
  • hi padraic, have a look at this thread...it seems the issue is around lxml and/or libxml2...when i tried using a different parsing method i did not have an issue. http://stackoverflow.com/questions/27766087/maximum-python-json-object-length-in-windows?noredirect=1#comment43946398_27766087 – gdogg371 Jan 04 '15 at 20:39
  • @gdogg371, I had actually tried using different parsers including lxml which is the default for me and they all worked so I did not think that was the issue. You must have an old version installed. Did you try upgrading? – Padraic Cunningham Jan 04 '15 at 21:27
  • i'm going to do that in a little while. i think the version i have was auto installed during the scrapy install. do you know which module on pypi i should be upgrading/updating? – gdogg371 Jan 04 '15 at 22:00
  • what happens if you use `lxml` as the parser? – Padraic Cunningham Jan 04 '15 at 22:02