My python regex using re
is returning a list of tuples where I only expect a simple list.
I am scraping over 100 webpages using beautiful soup to obtain software version information that is returned in a javascript array. The script loops over a list of IP addresses and version info is logged to file.
Example source:
<script language="Javascript">
var active='2';
Appname[0]='foobar';
Appversion[0]='6.02 TEST prod';
Appversion[0]='6.02 TEST ';
Appname[1]='barfoo';
Appversion[1]='06.0001.00';
Appversion[1]='06.0001.00';
Appname[1]='barfoo';
Appversion[1]='06.0001.00';
Appversion[1]='06.0001.00';
</script>
As you can see the second array element is usually declared twice, and sometimes four times (but not on all instances of the retrieved web page). I would like to simply take the last value assigned. The active
variable is one more than the required array element. There are other arrays and variables omitted for clarity.
Desired output for the example:
<ip addr> 1 barfoo 06.0001.00
My code is based on this post and this:
jscript = soup.find("script", text=lambda text: "var Appversion" in text)
# following lookaround regex; match but discard active=' and '; to return only the assigned value
regex = r"(?<=active=')((.*)(?=';))"
pattern = re.compile(regex)
active = pattern.search(jscript.get_text(), re.MULTILINE)
index = int(active.group())-1
pstring = pstring + '\t' + str(index)
array_names = ["Appname", "Appversion"]
for name in array_names:
regex=r"(?<=" + name + r"\[" + str(index) + r"\]=')((.*)(?=';))"
pattern = re.compile(regex)
name_match = pattern.findall(jscript.get_text(), re.MULTILINE)
last_index = len(name_match)-1
# no idea why we have tuples, what do we have
print(name_match)
# crashes for Appversion in example
pstring = pstring + '\t' + name_match[last_index][last_index]
The output from print (name_match)
is:
[('barfoo', 'barfoo'), ('barfoo', 'barfoo')]
[('06.0001.00', '06.0001.00'), ('06.0001.00', '06.0001.00'), ('06.0001.00', '06.0001.00'), ('06.0001.00', '06.0001.00')]
I thought regex would do the job: do a findall and then take the last index name_match[last_index]
. But each match is a tuple and I do not understand why this is the case.
name_match[last_index][last_index]
works for the Appname (by luck, I think), but not for Appversion.
Why does the regex return a tuple? How do I ensure the regex returns a simple list of matches or otherwise code correctly? I think my regex declaration using lookaround is OK (the same format works fine to extract the value of active
). This is my first time using re
and I think of it as grep
.