As a beginning Pythoner I don't understand why I get an infinity loop with while?

Question

This code always gives a infinity loop in while:

pos1 = 0
pos2 = 0
url_string = '''<h1>Daily News </h1><p>This is the daily news.</p><p>end</p>'''
i = int(len(url_string))
#print i  # debug
while i > 0:
    pos1 = int(url_string.find('>'))
    #print pos1 # debug
    pos2 = int(url_string.find('<', pos1))
    #print pos2  # debug
    url_string = url_string[pos2:]
    #print url_string  # debug
    print int(len(url_string))  # debug
    i =  int(len(url_string))

I tried everything without results.

More info:

Python 2.7.5+ (default, Sep 19 2013, 13:48:49)
[GCC 4.8.1] on linux2
Ubuntu 13.10
Run in GNOME Terminal 3.6.1 (also tried in Emacs and PyCharm without a solution to the infinity problem)

what's your debug output? it must be a big hint. `print url_string` — Karoly Horvath, Feb 08 '14 at 12:50
note: no need to `int` cast and an html code is not a "url". — Karoly Horvath, Feb 08 '14 at 12:56

score 3 · Answer 1 · edited Feb 10 '14 at 05:57

pos1 = int(url_string.find('>'))
pos2 = int(url_string.find('<', pos1))

You're finding the first < that occurs after the first >. There won't always be a < after the first >. When find can't find a <, it'll return -1, and the following:

url_string = url_string[pos2:]

will use url_string[-1:], a slice consisting of the last character of url_string. At that point, Python keeps looping, not finding <, and taking the last character of url_string until you get bored and hit Ctrl+C.

It's not clear what the fix is, as it's not clear what you're even trying to do. You might use while i > 1; or you might switch > and < in the computation of pos1 and pos2, and use url_string = url_string[pos2+1:]; or you might do something else. It depends on the goal you're trying to achieve.

score 0 · Answer 2 · answered Feb 08 '14 at 13:22

As pointed out above by @user2357112 you are never getting past the end of your string.

There are a few solutions, but one simple one (based on not really knowing what you are trying to achieve) would be to include the knowledge of pos1 and pos2 in your loop.

while (i > 0 && pos1 >= 0 && pos2 >= 0):

If either of the characters you are looking for isn't found, then the loop will stop.

ssm · Answer 3 · 2014-02-08T13:42:42.457

0

It is just easier to split the string and count the number of letters like so:

map(len, url_string.split('<')) # This equals [0, 14, 4, 25, 3, 5, 3]

Thats not what you want. You want the cumulative sum of this list. Get it like this:

import numpy as np
lens = np.cumsum( map(len, url_string.split('<')) )

Now we are not quite thee yet. You need to also add the missing '<' that you filtered out from the strings when you split it using that. So for that you will have to add them in. Like so:

 lens = lens + arange(len(lens))

This should work for single character splits.

Edit

As pointed out the requirement was to just extract the stuff which is not part of the tags. Then the one liner ...

''.join( map(lambda x: x.split('>')[-1] ,  url_string.split('<')) )

should do the job. Thanks for pointing that out!

edited Feb 08 '14 at 13:42

answered Feb 08 '14 at 13:22

ssm

5,277
1
24
42

For multiple characters, like splitting at `'
'` as an example, you will need to modify the last line as `lens = lens + len('
') * arange(len(lens))`
– ssm Feb 08 '14 at 13:25
Doesn't this also include all the tags? I do believe the idea was to output everything that is not a tag. – Peter Abolins Feb 08 '14 at 13:26
These are the positions of the tags. I think I misunderstood the question. Gimme a sec. Ill look at the code again ... – ssm Feb 08 '14 at 13:35
In that case, the one liner `''.join( map(lambda x: x.split('>')[-1] , url_string.split('<')) )` should do it. – ssm Feb 08 '14 at 13:40

score 0 · Accepted Answer · answered Feb 08 '14 at 13:29

It looks like you're trying to parse HTML to get data out of elements (e.g. I want the data inside the h1 tags, like 'Daily News '). If this is the case, I recommend using another library called BeautifulSoup4 at this link: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start

That said, since I'm not exactly sure what the program is meant to do, I broke down your code so that it's hopefully easier for you to see what's going on with the variables (and for now, took out the while loop). This will let you see exactly what your code has done without it running into an infinite loop.

# Setup Variables
pos1 = 0
pos2 = 0
url_string = '''<h1>Daily News </h1><p>This is the daily news.</p><p>end</p>'''
i = int(len(url_string)) # the url_string length is 60 characters
print "Setting up Variables with string at ", i, " characters"
print "String is: ", url_string

"""string.find(s, sub[, start[, end]])
Return the lowest index in s where the substring sub is found such that sub is 
wholly contained in s[start:end]. Return -1 on failure. Defaults for start and 
end and interpretation of negative values is the same as for slices.

Source: http://docs.python.org/2/library/string.html
"""

print "Running through program first time"
pos1 = int(url_string.find('>'))
# This finds the first occurrence of '>', which is at position 6

pos2 = int(url_string.find('<', pos1))
# This finds the first occurrence of '<' after position 3 ('>'),
# which is at position 15
print "Pos1 is at:", pos1, " and pos2 is at:", pos2

url_string = url_string[pos2:] # trimming string down?
print "The string is now: ", url_string
# </h1><p>This is the daily news.</p><p>end</p>

print "The string length is now: ", int(len(url_string)) # string length now 45
i = int(len(url_string)) # updating the length var to the new length

This is what it looks like on terminal: Running program

As a beginning Pythoner I don't understand why I get an infinity loop with while?

4 Answers4