python re.findall returns a list of tuples (strings are expected)

Question

re.findall returns a list of tuples that containing the expected strings and also something unexpected.

I was conducting a function findtags(text) to find tags in a given paragraph text. When I called re.findall(tags, text) to find defined tags in the text, it returns a list of tuple. Each tuple in the list contains the string that I expected it to return.

The function findtags(text) is as follows:

import re

def findtags(text):
    parms = '(\w+\s*=\s*"[^"]*"\s*)*'
    tags = '(<\s*\w+\s*' + parms + '\s*/?>)'
    print(re.findall(tags, text))
    return re.findall(tags, text)

testtext1 = """
My favorite website in the world is probably 
<a href="www.udacity.com">Udacity</a>. If you want 
that link to open in a <b>new tab</b> by default, you should
write <a href="www.udacity.com"target="_blank">Udacity</a>
instead!
"""

findtags(testtext1)

The expected result is

['<a href="www.udacity.com">', 
 '<b>', 
 '<a href="www.udacity.com"target="_blank">']

The actual result is

[('<a href="www.udacity.com">', 'href="www.udacity.com"'), 
 ('<b>', ''), 
 ('<a href="www.udacity.com"target="_blank">', 'target="_blank"')]

score 1 · Accepted Answer · answered Oct 03 '19 at 14:53

re.findall return a tuple because you have two capturing group just make the params group non capturing one using ?::

import re

def findtags(text):
    # make this non capturing group
    parms = '(?:\w+\s*=\s*"[^"]*"\s*)*'
    tags = '(<\s*\w+\s*' + parms + '\s*/?>)'
    print(re.findall(tags, text))
    return re.findall(tags, text)

testtext1 = """
My favorite website in the world is probably 
<a href="www.udacity.com">Udacity</a>. If you want 
that link to open in a <b>new tab</b> by default, you should
write <a href="www.udacity.com"target="_blank">Udacity</a>
instead!
"""

findtags(testtext1)

OUPUT:

['<a href="www.udacity.com">', '<b>', '<a href="www.udacity.com"target="_blank">']

Another why is if there is no capturing group re.findall will return matched text:

# non capturing group
parms = '(?:\w+\s*=\s*"[^"]*"\s*)*'
# no group at all
tags = '<\s*\w+\s*' + parms + '\s*/?>'

score 0 · Answer 2 · answered Oct 03 '19 at 14:24

According to the docs for re.findall:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

In your case, the stuff in parentheses in parms = '(\w+\s*=\s*"[^"]*"\s*)*' is a repeated group, so a list of tuples of possibly empty strings is returned.

score 0 · Answer 3 · answered Oct 03 '19 at 14:24

0

Looks like you don't want to return your inner capture group matches, so make it a non-capturing group instead.

parms = '(?:\w+\s*=\s*"[^"]*"\s*)*'

answered Oct 03 '19 at 14:24

benvc

14,448
4
33
54

python re.findall returns a list of tuples (strings are expected)

3 Answers3