Match port number

Question

I am trying to match port numbers in <span> tags from an html page:

<span class="tbBottomLine" style="width:50px;">
                8080
        </span>
<span class = "tbBottomLine" style = "width: 50px;">
            80
    </ span>
<span class = "tbBottomLine" style = "width: 50px;">
            3124
    </ span>
<span class = "tbBottomLine" style = "width: 50px;">
            1142
    </ span>

Script:

import urllib2
import re

h = urllib2.urlopen('http://www.proxy360.cn/Region/Brazil')

html = h.read()

parser_port = '<span.*>\s*([0-9]){2,}\s*</span>'

p = re.compile(parser_port)

list_port = p.findall(html)

print list_port

But I'm getting this output:

['8', '8', '0', '0', '0', '8', '8', '0', '0', '8', '8', '8', '8', '8', '8', '8', '8', '0']

And I need it to match 8080 for example.

And what is the end result you're looking for? Just the 8080? — LPChip, Nov 19 '15 at 09:44
I'm sorry, you're still not making any sense. Am I correct to understand that after executing the regex, it would just find the 8080, 80, 3124 and 1142 as in your example? or does it also need to contain more? — LPChip, Nov 19 '15 at 09:55
This is closer to what you are after [`>\s*([0-9]){2,}\s*<`](http://regexr.com/3c8at) — Burgi, Nov 19 '15 at 10:29
You have completely changed your question now. It is still not clear what you are trying to achieve. — Burgi, Nov 19 '15 at 11:39
Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — , May 05 '17 at 04:24
[*Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.* - Jamie Zawinski](http://regex.info/blog/2006-09-15/247) — , May 05 '17 at 04:25

score 0 · Answer 1 · answered Nov 19 '15 at 12:10

0

If you're looking to pull the ports off the page.

parser_port = '<span.*>\s*([0-9]{2,})+\s*</span>'

You want one or more characters (the + sign) with a length of at least two (the {2,}. But still unclear as to what the use case is.

answered Nov 19 '15 at 12:10

Andrew Sledge

10,163
2
29
30

Mariano · Accepted Answer · 2015-11-19T12:48:51.390

0

You are repeating the group ([0-9]){2,}. which is overwritten with the last value.

Instead, repeat the subpattern inside the group:

<span[^>]*>\s*([0-9]{2,})\s*</\s*span>

Code

parser_port = '<span[^>]*>\s*([0-9]{2,})\s*</\s*span>'
p = re.compile(parser_port)

list_port = p.findall(html)

edited Nov 19 '15 at 12:48

answered Nov 19 '15 at 12:18

Mariano

6,423
4
31
47

Match port number

2 Answers2