0

I am writing the following code and am facing a frustrating problem, and I have not been able to solve it after being stuck with it for two days.

This is the simplified code:

def crawl_web(url, depth):
    toCrawl = [url]
    crawled = ['https://index.html']
    i = 0
    while i <= depth:
        interim = []
        for x in toCrawl:
            if x not in toCrawl and x not in crawled and x not in interim:
                print("NOT IN")
            crawled.append(x)
        toCrawl = interim
        i += 1
    return crawled

print(crawl_web("https://index.html", 1))

The outcome I expect should be just:

['https://index.html']

But somehow, the "if not in" does not work and keeps giving me this as the output:

['https://index.html','https://index.html']
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
John Jam
  • 185
  • 1
  • 1
  • 17
  • You append to `crawled` *regardless*. Did you mean to indent `crawled.append()` to be part of the `if` statement? – Martijn Pieters Oct 29 '15 at 09:43
  • It performs the loop twice for `depth=1`. `i <= 1` is `True` when `i == 0` and when `i == 1`. – Peter Wood Oct 29 '15 at 09:44
  • actually i realize i wrote the simplified code wrongly, should i rewrite it here or post another question with my corrected simplified code? – John Jam Oct 29 '15 at 09:48

1 Answers1

2

The crawled.append is called no matter what the if statement does, because it's on the same indentation level as the if statement. You need to move it inside.

def crawl_web(url, depth):
    toCrawl = [url]
    crawled = ['https://index.html']
    i = 0
    while i <= depth:
        interim = []
        for x in toCrawl:
            if x not in toCrawl and x not in crawled and x not in interim:
                print("NOT IN")
                crawled.append(x)
        toCrawl = interim
        i += 1
    return crawled

print(crawl_web("https://index.html", 1))
ojii
  • 4,729
  • 2
  • 23
  • 34