0

I have a piece of code that is parsing a JSON looking object embedded in the source code of this page:

http://www.whoscored.com/Matches/829726/Live/England-Premier-League-2014-2015-Stoke-Manchester-United

The code being used to do this is:

import json
import requests
import re

url = 'http://www.whoscored.com/Matches/829726/Live/England-Premier-League-2014-2015-Stoke-Manchester-United'

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
           'X-Requested-With': 'XMLHttpRequest',
           'Host': 'www.whoscored.com',
           'Referer': 'http://www.whoscored.com/'}


r = requests.get(url,  headers=headers)


from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content)
data_cen = re.compile('var matchCentreData = ({.*?})')
event_type = re.compile('var matchCentreEventTypeJson = ({.*?})')

data = soup.find("a", href="/ContactUs").find_next("script").text
d = json.dumps(data_cen.search(data).group(1))
e = json.dumps(event_type.search(data).group(1))

data_dict = json.loads(d)
event_dict = json.loads(e)
print(event_dict)
print(data_dict)

When I run this on a Windows platform, either in Python 2.7 IDLE, Command Shell or iPython, the whole object being parsed (it is extremely long) is not returned. The returned object seems to cut off about half way down.

This causes a noneType error for the line e = json.dumps(event_type.search(data).group(1)) as this is right near the bottom of the data structure and not printed to screen on my operating system. Another user has run this code on a Linux based operating system and not had the same problem.

Is anyone aware of any Windows specific issues around the maximum length of returned strings? Or can anyone else thing of any other issues that could be causing this to happen in Windows but not in Linux?

Thanks

gdogg371
  • 3,879
  • 14
  • 63
  • 107
  • There are no platform-related limits on strings, no. Something *else* is wrong. What does `len(data)` produce? – Martijn Pieters Jan 04 '15 at 13:41
  • A far more likely issue is with the HTML page; if it contains *broken* HTML (misformed) then it depends on the parser used how this is handled. If you have `lxml` installed on one machine but not on another you'll be using *different parsers* (`lxml` is the default if available). – Martijn Pieters Jan 04 '15 at 13:44
  • Another option is that your Windows machine is being served different content based on IP address (geocoding). You'll need to figure out how the *HTML source* differs in that case. Perhaps using [`difflib.ndiff()`](https://docs.python.org/2/library/difflib.html#difflib.ndiff) can help there; feed it *two lists lines* and it'll spit out annotated differences between the two. – Martijn Pieters Jan 04 '15 at 13:50
  • @MartijnPieters hi there, thanks for replying. to deal with your points in turn: 1) the length alternates between 120423 and 120539. The longer length is not the complete object still though. i am unsure as to why this is? 2) i have lxml installed on this machine and the other user has scrapy installed, a which has lxml as a dependency so im assuming they will too. 3) i am in manchester england, the other user is in the republic of ireland. i am not sure why different locations would change the content returned? – gdogg371 Jan 04 '15 at 14:03
  • @MartijnPieters the full data structure is in the source code of the webpage i am looking at. if i use a basic regex i can return the entire object, but i cant then parse sub components of it (which is the ultimate aim) without using another series of regexes as the object returned is a string, rather than a json object. – gdogg371 Jan 04 '15 at 14:08
  • Did the response arrive intact? Eg does `len(r.content)` match `r.headers['content-length']`? Do these match the Linux results? – Martijn Pieters Jan 04 '15 at 14:31
  • Also check the versions of lxml and libxml2; I've seen issues with certain libxml2 versions in the past. – Martijn Pieters Jan 04 '15 at 14:44
  • is the statement 'r.headers['content-length']' a literal or do i need to replace 'content-length' with something? that line of code throws up the error: 'Traceback (most recent call last): File "C:\Python27\counter.py", line 24, in print r.headers['content-length'] File "C:\Python27\lib\site-packages\requests\structures.py", line 77, in __getitem__ return self._store[key.lower()][1] KeyError: 'content-length'' – gdogg371 Jan 04 '15 at 14:46
  • Interesting, the response is chunked and has no preset content length. Does `len(r.content)` match what the Linux response receives? I see 1106584 characters on my end. – Martijn Pieters Jan 04 '15 at 15:08
  • @MartijnPieters unfortunately the other user is not online at the minute...the length you are returning will be for the total source code of the page i imagine. i am after just the objects called 'matchCentreData' 'matchCentreEventTypeJson'. the item 'data' is both of these objects matched within the code using beautiful soup and the length given earlier was in relation to this, rather than the full source code of the page. thanks. – gdogg371 Jan 04 '15 at 15:15
  • Yes, `r` is the response object, `r.content` the full HTML text to be parsed by BeautifulSoup. If it is shorter on Windows, then you didn't receive a complete response, and you need to start looking at either `requests` (could be you have a very old version with bugs) or at your network (a proxy server that breaks the response). – Martijn Pieters Jan 04 '15 at 15:20
  • @MartijnPieters i'm not on a proxy server here, just a home internet connection. when i have printed 'response.text' i can see that i am getting the complete source code for the full page printed to screen, which is a lot longer than the beautiful soup object. i think the issue is around this to be honest...are there any other modules for parsing json code with a webpage source code? – gdogg371 Jan 04 '15 at 15:23
  • 1
    if the full page loads you have a libxml2 problem. Verify it by using a different parser; use `BeautifulSoup(r.content, 'html.parser`)` to verify. – Martijn Pieters Jan 04 '15 at 15:58
  • @MartijnPieters that has worked correctly now. do you know which versions of libxml2 there are issues with, as you mentioned earlier? i will check which version i have. when i did the scrapy install on this machine it auto installed libxml2 i think. in a previous install of scrapy i did im sure i had to install that dependency manually... – gdogg371 Jan 04 '15 at 16:16
  • I'd just install the latest version. – Martijn Pieters Jan 04 '15 at 16:54
  • @MartijnPieters ive tried installing libxml2 and libxml2-python via pip and neither are found. is it contained within a different module? – gdogg371 Jan 04 '15 at 18:13
  • `libxml2` is not a Python module. It is a [C library](http://xmlsoft.org/), with [windows binaries](ftp://ftp.zlatkovic.com/libxml/) available. – Martijn Pieters Jan 04 '15 at 18:33
  • It *could* be that libxml2 is statically linked into lxml, so it could be that upgrading lxml would be enough. – Martijn Pieters Jan 04 '15 at 19:11

0 Answers0