2

I am attempting to simply make an independent copy of my URL class in python so I can modify the copy without affecting the original.

The following is a condensed, executable version of my problem code:

from bs4 import BeautifulSoup
from copy import deepcopy
from urllib import request

url_dict = {}


class URL:
    def __init__(self, url, depth, log_entry=None, soup=None):
        self.url = url
        self.depth = depth  # Current, not total, depth level
        self.log_entry = log_entry
        self.soup = soup
        self.indent = '    ' * (5 - self.depth)
        self.log_url = 'test.com'

        # Blank squad
        self.parsed_list = []

    def get_log_output(self):
        return self.indent + self.log_url

    def get_print_output(self):
        if self.log_entry is not None:
            return self.indent + self.log_url + ' | ' + self.log_entry

        return self.indent + self.log_url

    def set_soup(self):
        if self.soup is None:
            code = ''

            try:  # Read and store code for parsing
                code = request.urlopen(self.url).read()
            except Exception as exception:
                print(str(exception))

            self.soup = BeautifulSoup(code, features='lxml')


def crawl(current_url, current_depth):
    current_check_link = current_url
    has_crawled = current_check_link in url_dict
    
    if current_depth > 0 and not has_crawled:
        current_crawl_job = URL(current_url, current_depth)
        current_crawl_job.set_soup()
        url_dict[current_check_link] = deepcopy(current_crawl_job)


for link in ['http://xts.site.nfoservers.com']:  # Crawl for each URL the user inputs
    crawl(link, 3)

The resulting exception:

Traceback (most recent call last):
File "/home/[CENSORED]/.vscode-oss/extensions/ms-python.python-2020.10.332292344/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_trace_dispatch_regular.py", line 374, in __call__
if cache_skips.get(frame_cache_key) == 1:
RecursionError: maximum recursion depth exceeded in comparison
Fatal Python error: _Py_CheckRecursiveCall: Cannot recover from stack overflow.
Python runtime state: initialized

I am unable to tell where this specific infinite recursion is occurring. I've read through questions such as RecursionError when python copy.deepcopy but I'm not even sure that it applies to my use-case. If it does apply, then my brain just can't seem to understand it as I'm under the impression deepcopy() should just take each self variable value and duplicate it to the new class. If that's not the case, then I would love some enlightenment. All the articles in my search results are similar to this and aren't very helpful for my situation.

Please note, I'm not simply looking for a modified snippet of my code to fix this. I'd mainly like to understand what exactly is going on here so I can both fix it now and avoid it in the future.

Edit: It seems to be a clash between deepcopy and the set_soup() method. If I replace

url_dict[current_check_link] = deepcopy(current_crawl_job)

with

url_dict[current_check_link] = current_crawl_job

the snippet above runs without error. Likewise, if I completely remove current_crawl_job.set_soup(), I get no errors either. I just can't have both.

Edit2: I can remove either

try:  # Read and store code for parsing
    code = request.urlopen(self.url).read()
except Exception as exception:
    print(str(exception))

or

self.soup = BeautifulSoup(code, features='lxml')

and the error disappears yet again, with the program running normally.

z7r1k3
  • 718
  • 1
  • 5
  • 20
  • Post something that runs and reproduces the error when run. Also, it looks like you're running this code through an IDE, possibly with a debugger. Does the error still occur if you just run it directly through the command line? – user2357112 Apr 25 '21 at 23:14
  • @user2357112supportsMonica Yes, I get the same error running straight from CLI. And I'll work on a snippet to be run real quick. – z7r1k3 Apr 25 '21 at 23:17
  • It seems this is due to something else entirely, possibly in regards to my other recursion as I can't reproduce it in the smaller snippet. I'll keep working at it, but right off the bat it seems like it may have to do with the fact I'm already using recursion to search a tree. Once I have a functional problem snippet I'll update the question. I don't really expect people to debug my entire file for me, but figured I should provide at least *something* while I work on this. – z7r1k3 Apr 25 '21 at 23:35
  • 1
    You can use [`sys.settrace()`](https://docs.python.org/3/library/sys.html#sys.settrace) to trace calls in your code and determine where the recursion is taking place. See [Tracing a Program As It Runs](https://pymotw.com/3/sys/tracing.html#sys-tracing). – martineau Apr 25 '21 at 23:40
  • Added a working code snippet that produces the same error. I'll also take a look at that @martineau thank you. – z7r1k3 Apr 25 '21 at 23:49
  • `deepcopy()` has an internal cache that it uses to detect self-referential (recursive) data-structures precisely to prevent this sort of thing from happening — yet somehow it doesn't appear to be working based on the error information. The cache is so it can tell it has already seen and copied the object. – martineau Apr 25 '21 at 23:59
  • @martineau It looks like it's specifically clashing with my `set_soup()` method. I added an edit to the question showing it. If it's just a bug in deepcopy then that would make sense. – z7r1k3 Apr 26 '21 at 00:08
  • Likewise, removing the single line `self.soup = BeautifulSoup(code, features='lxml')` removes the errors as well. I really have no idea what's going on at this point. – z7r1k3 Apr 26 '21 at 00:15
  • Actually it looks like your `crawl()` function is trying to do something similar to what `deepcopy()` does — so the way it's doing that seems likely to be where the problem lies. – martineau Apr 26 '21 at 00:19
  • @martineau While I'm still not entirely sure what the problem is exactly, it looks like replacing `url_dict[current_check_link] = deepcopy(current_crawl_job)` with `URL(current_url, current_depth))` is an adequate workaround. – z7r1k3 Apr 26 '21 at 00:35
  • My guess is the logic behind the line `has_crawled = current_check_link in url_dict` is invalid for some reason. – martineau Apr 26 '21 at 00:40

1 Answers1

3

This Article states that,

Deep copy is a process in which the copying process occurs recursively. It means first constructing a new collection object and then recursively populating it with copies of the child objects found in the original.

So my understanding tells that,

A = [1,2,[3,4],5]

B = deepcopy(A) #This will make 1 level deep recursive call to copy the inner list

C = [1,[2,[3,[4,[5,[6]]]]]]

D = deepcopy(C) #This will make 5 levels deep recursive call (recursively copying inner lists)

My Best Guess

Python has a max recursion depth limit to prevent stack overflow.

You can find your max recursion depth limit using,

import sys
print(sys.getrecursionlimit())

In your case, you're trying to deep copy a Class Object. The recursive calls to deep copy that class object must be exceeding the max recursion limit.

Possible Solution

You can tell python to set a higher max recursion limit using,

limit = 2000
sys.setrecursionlimit(limit)

or you may go fancy and increase that limit as your program progresses. More about this on this link.

I am not 100% sure that increasing the limit is gonna do the job but I'm pretty sure that some child object of your class object has so many inner objects that it's making the deepcopy go nuts!

Edit

Something tells me that the below line is the culprit,

self.soup = BeautifulSoup(code, features='lxml')

When you do the current_crawl_job.set_soup() your class's None soup object is getting replaced with a complex BeautifulSoup object. Which is giving trouble to the deepcopy method.

Suggestion

In the set_soup method, Keep the self.soup attribute as a raw html string and convert it to the BeautifulSoup object when you are trying to modify it. This will solve your deepcopy problem.

Jay Shukla
  • 454
  • 4
  • 13
  • You know what, I think you're right. Gonna put it to the test real quick, but I'm fairly certain now that it must be `self.soup`, as that contains a lot of information about the webpage, including the source code itself. I may just change it to a str inside the class, and handle the actual soup outside of it. We'll see how it goes. – z7r1k3 Apr 26 '21 at 00:53
  • Yeah, that's the way to go. I unknowingly just edited my answer telling the same thing lol! – Jay Shukla Apr 26 '21 at 00:56