I am attempting to simply make an independent copy of my URL
class in python so I can modify the copy without affecting the original.
The following is a condensed, executable version of my problem code:
from bs4 import BeautifulSoup
from copy import deepcopy
from urllib import request
url_dict = {}
class URL:
def __init__(self, url, depth, log_entry=None, soup=None):
self.url = url
self.depth = depth # Current, not total, depth level
self.log_entry = log_entry
self.soup = soup
self.indent = ' ' * (5 - self.depth)
self.log_url = 'test.com'
# Blank squad
self.parsed_list = []
def get_log_output(self):
return self.indent + self.log_url
def get_print_output(self):
if self.log_entry is not None:
return self.indent + self.log_url + ' | ' + self.log_entry
return self.indent + self.log_url
def set_soup(self):
if self.soup is None:
code = ''
try: # Read and store code for parsing
code = request.urlopen(self.url).read()
except Exception as exception:
print(str(exception))
self.soup = BeautifulSoup(code, features='lxml')
def crawl(current_url, current_depth):
current_check_link = current_url
has_crawled = current_check_link in url_dict
if current_depth > 0 and not has_crawled:
current_crawl_job = URL(current_url, current_depth)
current_crawl_job.set_soup()
url_dict[current_check_link] = deepcopy(current_crawl_job)
for link in ['http://xts.site.nfoservers.com']: # Crawl for each URL the user inputs
crawl(link, 3)
The resulting exception:
Traceback (most recent call last):
File "/home/[CENSORED]/.vscode-oss/extensions/ms-python.python-2020.10.332292344/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_trace_dispatch_regular.py", line 374, in __call__
if cache_skips.get(frame_cache_key) == 1:
RecursionError: maximum recursion depth exceeded in comparison
Fatal Python error: _Py_CheckRecursiveCall: Cannot recover from stack overflow.
Python runtime state: initialized
I am unable to tell where this specific infinite recursion is occurring. I've read through questions such as
RecursionError when python copy.deepcopy but I'm not even sure that it applies to my use-case. If it does apply, then my brain just can't seem to understand it as I'm under the impression deepcopy()
should just take each self
variable value and duplicate it to the new class. If that's not the case, then I would love some enlightenment. All the articles in my search results are similar to this and aren't very helpful for my situation.
Please note, I'm not simply looking for a modified snippet of my code to fix this. I'd mainly like to understand what exactly is going on here so I can both fix it now and avoid it in the future.
Edit: It seems to be a clash between deepcopy
and the set_soup()
method. If I replace
url_dict[current_check_link] = deepcopy(current_crawl_job)
with
url_dict[current_check_link] = current_crawl_job
the snippet above runs without error. Likewise, if I completely remove current_crawl_job.set_soup()
, I get no errors either. I just can't have both.
Edit2: I can remove either
try: # Read and store code for parsing
code = request.urlopen(self.url).read()
except Exception as exception:
print(str(exception))
or
self.soup = BeautifulSoup(code, features='lxml')
and the error disappears yet again, with the program running normally.