Flatten regex group matches

Question

Summary

I created a regex that will grab multiple different data from an html page. It uses grouped alternatives within a non-capture group. It works really well to grab the needed data; however, the groups are not combined in as few matches as possible tester

While coding it up, I thought the matches and groups seemed a little weird with the online regex tester, but it wasn't until I got it working in python that I noticed the issue with my groups heirarchy.

My only solutions appear to be to...

Rewrite the regex to have 5 matched groups per match
Write python code to flatten the data structure.
Something else???

Number 1 above would be better out of the ideas. I don't want to "pollute" my code base with unneeded code

Code

Regex

^.*(?:<p.*>(.*)|<span>(.*)<\/span>|<a href=\"(.*linux(?:\.tar\.gz|.zip))\">(.*)</a>.*\((.*) bytes\))

Playground https://regex101.com/r/MC8TOv/1/

Webscraped Site https://android-dot-devsite-v2-prod.appspot.com/studio/archive_25350a46834ddb86754aba2445ff1359aa7fd8cb296923255092494ac94ef531.frame

Python

While using BeautifulSoup

import re

...

regex = r"^.*(?:<p.*>(.*)|<span>(.*)<\/span>|<a href=\"(.*linux(?:.tar.gz|.zip))\">(?P<filename>.*)</a>.*\((.*) bytes\))"          
match = re.findall(regex, str(soup_html), re.M)

print(match)

Expectations

What I am getting This is some generic output I am getting.

 [
     ('A1', '', '', '', ''),
     ('', 'B1', '', '', ''),
     ('', '', 'C1', 'D1', 'E1'),
     ('A2', '', '', '', ''),
     ('', 'B2', '', '', ''),
     ('', '', 'C2', 'D2', 'E2'),
     ...
 ]

What I want

 [
     ('A1', 'B1', 'C1', 'D1', 'E1'),
     ('A2', 'B2', 'C2', 'D2', 'E2')
     ...
 ]

Again, is there are way to rewrite the regex to have 5 matched groups per match?

Why the downvote? How does this not show research or not clear? — Christopher Rucinski, Oct 16 '19 at 05:24

score 0 · Answer 1 · answered Oct 12 '19 at 20:38

0

If the five things have to appear in a sequence as your example suggests, combine them with .*? rather than |:

(regex1).*?(regex2).*?(regex3).*(regex4).*?(regex5)

instead of

(regex1)|(regex2)|(regex3).*(regex4).*?(regex5)

If, on the other hand, they don't necessarily have to appear in this order, then I don't quite see how you expect to shoehorn them into the rigid five-group structure.

Either way, there's also the possibility of post-processing the results after you've applied the regex.

answered Oct 12 '19 at 20:38

NPE

486,780
108
951
1,012

The data is layed out regularly, so technically something like this **could** work, but I need to rework it. I have my code based on multiline and using the start of each line. This logic has be removed – Christopher Rucinski Oct 12 '19 at 20:44
This appears to fail because of the newlines – Christopher Rucinski Oct 12 '19 at 21:08