Summary
I created a regex that will grab multiple different data from an html page. It uses grouped alternatives within a non-capture group. It works really well to grab the needed data; however, the groups are not combined in as few matches as possible tester
While coding it up, I thought the matches and groups seemed a little weird with the online regex tester, but it wasn't until I got it working in python that I noticed the issue with my groups heirarchy.
My only solutions appear to be to...
- Rewrite the regex to have 5 matched groups per match
- Write python code to flatten the data structure.
- Something else???
Number 1 above would be better out of the ideas. I don't want to "pollute" my code base with unneeded code
Code
Regex
^.*(?:<p.*>(.*)|<span>(.*)<\/span>|<a href=\"(.*linux(?:\.tar\.gz|.zip))\">(.*)</a>.*\((.*) bytes\))
Playground https://regex101.com/r/MC8TOv/1/
Webscraped Site https://android-dot-devsite-v2-prod.appspot.com/studio/archive_25350a46834ddb86754aba2445ff1359aa7fd8cb296923255092494ac94ef531.frame
Python
While using BeautifulSoup
import re
...
regex = r"^.*(?:<p.*>(.*)|<span>(.*)<\/span>|<a href=\"(.*linux(?:.tar.gz|.zip))\">(?P<filename>.*)</a>.*\((.*) bytes\))"
match = re.findall(regex, str(soup_html), re.M)
print(match)
Expectations
What I am getting This is some generic output I am getting.
[
('A1', '', '', '', ''),
('', 'B1', '', '', ''),
('', '', 'C1', 'D1', 'E1'),
('A2', '', '', '', ''),
('', 'B2', '', '', ''),
('', '', 'C2', 'D2', 'E2'),
...
]
What I want
[
('A1', 'B1', 'C1', 'D1', 'E1'),
('A2', 'B2', 'C2', 'D2', 'E2')
...
]
Again, is there are way to rewrite the regex to have 5 matched groups per match?