I'm working on a web scraping component using BeautifulSoup and RoboBrowser, and have run into a curious problem with one case in particular. The page in question contains all the same chrome and structure as all the other cases that work fine, but its main data field (a neatly labelled div) is one huge line (around 3000 characters of Japanese text) with no linebreaks. It's peppered with a LOT of BR tags (they're using them in a rather gruesome way to format tables...) and a few SPAN tags for formatting, but the whole body text is one a single line.
That doesn't seem like it should be a problem, but my scraper dies with a RecursionError: maximum recursion depth exceeded in comparison
, after spitting out several hundred (possibly thousands) identical pairs of these lines:
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1126, in decode
indent_contents, eventual_encoding, formatter)
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1195, in decode_contents
formatter))
I was originally blaming BeautifulSoup and thought the sheer number of BR tags was throwing it off, but it seems the problem is actually in the Unicode. Here's the code that's throwing it:
File "/Users/myself/Projects/Scraper/scrape.py", line 207, in articles
self._childtext = re.sub('<[^<]+?>', '', str(self._one_child).replace('<br/>', '\n'))
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1039, in __unicode__
return self.decode()
I thought it might be the line length, hence why I'm parsing the DIV block child-by-child instead of doing the whole thing at once, but it hasn't helped in the slightest. No matter how small the chunks, the str(bsObject)
function seems to drive the unicode parser into an insane frenzy.
To thicken the plot slightly; I copied the entire text of the page source into a new Python sandbox as a long string, so I could test different code against it without constantly logging into the website. Python promptly refused to compile the code (complaining that it contained non-UTF8 characters) even after I ran the text through vi and forced it to save as UTF8. However, inserting newlines into the text to divide it up into smaller chunks stopped this error from appearing, despite not changing or deleting a single character of the text itself, at which point the script compiled and scraped the page perfectly.
I have no idea how to proceed from here. I don't control the site I'm scraping from; I thought about forcing newlines into the response object in RoboBrowser before BeautifulSoup touches it, which is a horrible hack but seems like it might fix thing, but I'm not sure how to go about that. Can anyone suggest another approach?
(Unfortunately I cannot link to the page I'm scraping data from as it's a research data supplier that requires a login and doesn't have permanent URLs for individual pieces of data.)
Edit: Adding full stacktrace below...
Traceback (most recent call last):
File "scrape.py", line 112, in <module>
dataScrape()
File "scrape.py", line 39, in dataScrape
for article in scraper.articles():
File "/Users/myself/Projects/Scraper/scrape.py", line 207, in articles
self._childtext = re.sub('<[^<]+?>', '', str(self._one_child).replace('<br/>', '\n'))
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1039, in __unicode__
return self.decode()
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1126, in decode
indent_contents, eventual_encoding, formatter)
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1195, in decode_contents
formatter))
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1126, in decode
indent_contents, eventual_encoding, formatter)
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1195, in decode_contents
formatter))
#
# These lines repeat identically several hundred times, then...
#
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 1192, in decode_contents
text = c.output_ready(formatter)
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 716, in output_ready
output = self.format_string(self, formatter)
File "/usr/local/lib/python3.5/site-packages/bs4/element.py", line 158, in format_string
if not isinstance(formatter, collections.Callable):
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/abc.py", line 182, in __instancecheck__
if subclass in cls._abc_cache:
File "/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5/lib/python3.5/_weakrefset.py", line 75, in __contains__
return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison