I cannot seem to handle blank results from regex(re.search) in python, i either get duplicates or no results?

Question

I am trying to pull list of individuals from https://www.ourcommons.ca/Parliamentarians/en/members?view=List. Once I have the list I go through each members link and try to find their email address.

Some of the members don't have email as a result of which the code is failing. I tried adding code where result of match is none and i get duplicate results in that case.

I am using the following logic for matching

mat = re.search(r'mailto:\w*\.\w*@parl.gc.ca',ln1.get('href'))
    if mat:
        email.append(mat.group())
    else:
        email.append("No Email Found")

the if condition is where the issue. when i use the else it give "No Email Found" for every row once.

weblinks=[]
email=[]

page = requests.get('https://www.ourcommons.ca/Parliamentarians/en/members?view=ListAll')
soup = BeautifulSoup(page.content, 'lxml')


for ln in soup.select(".personName > a"):
    weblinks.append("https://www.ourcommons.ca" + ln.get('href'))
    if(len(weblinks)==10):
        break

extracts emails

for elnk in weblinks:
    pagedet = requests.get(elnk)
    soupdet = BeautifulSoup(pagedet.content, 'lxml')
    for ln1 in soupdet.select(".caucus > a"):
        mat = re.search(r'mailto:\w*\.\w*@parl.gc.ca',ln1.get('href'))
        if mat:
            email.append(mat.group())
        else:
            email.append("No Email Found")

print("Len Email:",len(email))

Expected result: show email for the page which has one and a blank for the page which doesn't have.

Your code seems to work for me. What versions of Python and beautifulsoup are you using? — Matt Pitkin, Sep 18 '19 at 11:33
What do you mean duplicate results? Does it mean you are getting two of the same emails when it's a match and two `"No Email Found"` when a match isn't found? — r.ook, Sep 18 '19 at 12:33

score 0 · Answer 1 · answered Sep 18 '19 at 13:47

If check the page DOM there are two similar elements present that is why you are getting multiple values.you need to put condition to get rid of that.Try below code.

weblinks=[]
email=[]

page = requests.get('https://www.ourcommons.ca/Parliamentarians/en/members?view=ListAll')
soup = BeautifulSoup(page.content, 'lxml')


for ln in soup.select(".personName > a"):
    weblinks.append("https://www.ourcommons.ca" + ln.get('href'))
    if(len(weblinks)==10):
        break


for elnk in weblinks:
    pagedet = requests.get(elnk)
    soupdet = BeautifulSoup(pagedet.content, 'lxml')
    if len(soupdet.select(".caucus > a"))> 1:
       for ln1 in soupdet.select(".caucus > :not(a[target])"):
          mat = re.search(r'mailto:\w*\.\w*@parl.gc.ca',ln1.get('href'))
          if mat:
            email.append(mat.group())
          else:
            email.append("No Email Found")
    else:
       for ln1 in soupdet.select(".caucus > a"):
         mat = re.search(r'mailto:\w*\.\w*@parl.gc.ca', ln1.get('href'))
         if mat:
             email.append(mat.group())
         else:
             email.append("No Email Found")

print(email)
print("Len Email:",len(email))

Output:

['mailto:Ziad.Aboultaif@parl.gc.ca', 'mailto:Dan.Albas@parl.gc.ca', 'mailto:harold.albrecht@parl.gc.ca', 'mailto:John.Aldag@parl.gc.ca', 'mailto:Omar.Alghabra@parl.gc.ca', 'mailto:Leona.Alleslev@parl.gc.ca', 'mailto:dean.allison@parl.gc.ca', 'No Email Found', 'No Email Found', 'mailto:Gary.Anand@parl.gc.ca']

Len Email: 10

I cannot seem to handle blank results from regex(re.search) in python, i either get duplicates or no results?

extracts emails

1 Answers1