1

I am trying to match port numbers in <span> tags from an html page:

<span class="tbBottomLine" style="width:50px;">
                8080
        </span>
<span class = "tbBottomLine" style = "width: 50px;">
            80
    </ span>
<span class = "tbBottomLine" style = "width: 50px;">
            3124
    </ span>
<span class = "tbBottomLine" style = "width: 50px;">
            1142
    </ span>

Script:

import urllib2
import re

h = urllib2.urlopen('http://www.proxy360.cn/Region/Brazil')

html = h.read()

parser_port = '<span.*>\s*([0-9]){2,}\s*</span>'

p = re.compile(parser_port)

list_port = p.findall(html)

print list_port

But I'm getting this output:

['8', '8', '0', '0', '0', '8', '8', '0', '0', '8', '8', '8', '8', '8', '8', '8', '8', '0']

And I need it to match 8080 for example.

Mariano
  • 6,423
  • 4
  • 31
  • 47
Mr.Junsu
  • 97
  • 1
  • 12

2 Answers2

0

If you're looking to pull the ports off the page.

parser_port = '<span.*>\s*([0-9]{2,})+\s*</span>'

You want one or more characters (the + sign) with a length of at least two (the {2,}. But still unclear as to what the use case is.

Andrew Sledge
  • 10,163
  • 2
  • 29
  • 30
0

You are repeating the group ([0-9]){2,}. which is overwritten with the last value.

Instead, repeat the subpattern inside the group:

<span[^>]*>\s*([0-9]{2,})\s*</\s*span>

Code

parser_port = '<span[^>]*>\s*([0-9]{2,})\s*</\s*span>'
p = re.compile(parser_port)

list_port = p.findall(html)
Mariano
  • 6,423
  • 4
  • 31
  • 47