0

I have written a regex to scrape the data from the web page. However I am getting the mentioned error. I am not able to find a solution to that. Someone had suggested

try:
    code
except:
     Attribute error

Original Code:

import urllib.request
import bs4
import re

url ='https://ipinfo.io/AS7018'
def url_to_soup(url):
    req = urllib.request.Request(url)
    opener = urllib.request.build_opener()
    html = opener.open(req)
    soup = bs4.BeautifulSoup(html, "html.parser")
    return soup


s = str(url_to_soup(url))
#print(s)
asn_code, name = re.search(r'<h3 class="font-semibold m-0 t-xs-24">(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s]+)</h3>', s)\
        .groups() # Error code
print(asn_code)
""" This is where the error : From above code """
country = re.search(r'.*href="/countries.*">(?P<COUNTRY>.*)?</a>',s).group("COUNTRY")
print(country)
registry = re.search(r'Registry.*?pb-md-1">(?P<REGISTRY>.*?)</p>',s, re.S).group("REGISTRY").strip()
print(registry)
# flag re.S make the '.' special character match any character at all, including a newline;
ip = re.search(r'IP Addresses.*?pb-md-1">(?P<IP>.*?)</p>',s, re.S).group("IP").strip()
print(ip)

2 Answers2

1

The statement:

re.search(r'<h3 class="font-semibold m-0 t-xs-24">(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s]+)</h3>', s)

is returning None the pattern you're looking for has not being found in the string s.

According to documentation for re.search

Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

You have to redesign your regex or debug your code in order to find out what s contains by the time the mentioned pattern is used.

Raydel Miranda
  • 13,825
  • 3
  • 38
  • 60
1

re.search returns None when it fails to find anything. None does not respond to the method .groups(). Check whether a match exists or not before you inspect the match in detail.

match = re.search(r'<h3 class="font-semibold m-0 t-xs-24">(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s]+)</h3>', s)
if match:
    asn_code, name = match.groups()

However, since you're using Beautiful Soup, why stringify and then regex match? It's like buying a packet of instant soup, adding the powder to the water, boiling the thing, then dehydrating it back to powder. Why even use BeautifulSoup then?

soup.select('h3.font-semibold.m-0.t-xs-24')[0].content

will give you the content of that <h3> element; then apply regex on that, if you need to. Regexping through HTML documents is generally a bad idea.

EDIT: What exactly gives you TypeError? This is a typical XY problem, where you're solving the wrong thing. I verified this to work, with no TypeError (Python 3):

ast_re = re.compile(r'(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s]+)')
soup = url_to_soup(url)
ast_h3 = next(
    (m for m in (ast_re.match(h3.text) for h3 in soup.select('h3')) if m),
    None)
if ast_h3:
    asn_code, name = asn_h3.groups()
Amadan
  • 191,408
  • 23
  • 240
  • 301
  • Thanks for the solution. It works partly, because I guess some issue with regex which Im not able to fix. However let me try and answer the second part i why I am stringfying bs4, its becuase it gives me "TypeError: expected string or bytes-like object" error. In any case can you pls help with the regex to match. I believe when there is &amp in line the regex is rejecting it. Which I am trying to overcome. –  Jan 15 '19 at 06:42