0

My python regex using re is returning a list of tuples where I only expect a simple list.

I am scraping over 100 webpages using beautiful soup to obtain software version information that is returned in a javascript array. The script loops over a list of IP addresses and version info is logged to file.

Example source:

<script language="Javascript">
var active='2';
Appname[0]='foobar';
Appversion[0]='6.02 TEST prod';
Appversion[0]='6.02 TEST ';
Appname[1]='barfoo';
Appversion[1]='06.0001.00';
Appversion[1]='06.0001.00';
Appname[1]='barfoo';
Appversion[1]='06.0001.00';
Appversion[1]='06.0001.00';
</script>

As you can see the second array element is usually declared twice, and sometimes four times (but not on all instances of the retrieved web page). I would like to simply take the last value assigned. The active variable is one more than the required array element. There are other arrays and variables omitted for clarity.

Desired output for the example:

<ip addr>  1  barfoo  06.0001.00

My code is based on this post and this:

jscript = soup.find("script", text=lambda text: "var Appversion" in text)
# following lookaround regex; match but discard active=' and '; to return only the assigned value
regex = r"(?<=active=')((.*)(?=';))"
pattern = re.compile(regex)
active = pattern.search(jscript.get_text(), re.MULTILINE)
index = int(active.group())-1
pstring = pstring + '\t' + str(index)
array_names = ["Appname", "Appversion"]
for name in array_names:
    regex=r"(?<=" + name + r"\[" + str(index) + r"\]=')((.*)(?=';))"
    pattern = re.compile(regex)
    name_match = pattern.findall(jscript.get_text(), re.MULTILINE)
    last_index = len(name_match)-1
    # no idea why we have tuples, what do we have 
    print(name_match)
    #  crashes for Appversion in example
    pstring = pstring + '\t' + name_match[last_index][last_index]

The output from print (name_match) is:

[('barfoo', 'barfoo'), ('barfoo', 'barfoo')]
[('06.0001.00', '06.0001.00'), ('06.0001.00', '06.0001.00'), ('06.0001.00', '06.0001.00'), ('06.0001.00', '06.0001.00')]

I thought regex would do the job: do a findall and then take the last index name_match[last_index]. But each match is a tuple and I do not understand why this is the case.

name_match[last_index][last_index] works for the Appname (by luck, I think), but not for Appversion. Why does the regex return a tuple? How do I ensure the regex returns a simple list of matches or otherwise code correctly? I think my regex declaration using lookaround is OK (the same format works fine to extract the value of active). This is my first time using re and I think of it as grep.

gloopy
  • 103
  • 2
  • 9
  • Remove all unnecessary capturing group parentheses, or convert all capturing groups to non-capturing. – Wiktor Stribiżew Apr 11 '18 at 06:48
  • @WiktorStribiżew I don't see that this is a duplicate; the accepted answer on [that post](https://stackoverflow.com/questions/31915018/python-re-findall-behaves-weird) did not change the regex grouping, although your solution did. But your comment above is different and helpful. when I remove the "group of groups" from the regex it works: `regex=r"(?<=" + name + r"\[" + str(index) + r"\]=')(.*)(?=';)"`. Thanks for that. – gloopy Apr 11 '18 at 07:38
  • @WiktorStribiżew as I understand it, the problem here is that the group of groups used by findall in my original code caused a second instance of the match which was output as a tuple of identical matches. – gloopy Apr 11 '18 at 07:43
  • This is marked as a dupe because you asked why re.findall returns a tuple, and [my answer](https://stackoverflow.com/a/31915134/3832970) explains that well enough. The solution is natural: get rid of the unnecessary capturing groups by either removing redundant ones or by converting those that group subpatterns into non-capturing ones. Do you think I need to add it to my answer there? – Wiktor Stribiżew Apr 11 '18 at 08:03
  • No one will read a post that says "my re.findall returns unexpected null results" and think it's the same as "my re.findall returns duplicates of the expected match in a tuple". Sure, it might be the same underlying cause but i wouldn't say that's sufficient. I did not find your answer useful except that it links to the re.findall doco. What I did find useful is your comment above to remove groups. As it stands I can't mark this as answered or give you any credit and the accepted answer of the "dupe" is not a solution here. But thanks again for the clue I needed. – gloopy Apr 12 '18 at 00:11
  • I [updated my answer](https://stackoverflow.com/a/31915134/3832970). – Wiktor Stribiżew Apr 12 '18 at 09:52

0 Answers0