-1

Summary

I created a regex that will grab multiple different data from an html page. It uses grouped alternatives within a non-capture group. It works really well to grab the needed data; however, the groups are not combined in as few matches as possible tester

While coding it up, I thought the matches and groups seemed a little weird with the online regex tester, but it wasn't until I got it working in python that I noticed the issue with my groups heirarchy.

My only solutions appear to be to...

  1. Rewrite the regex to have 5 matched groups per match
  2. Write python code to flatten the data structure.
  3. Something else???

Number 1 above would be better out of the ideas. I don't want to "pollute" my code base with unneeded code

Code

Regex

^.*(?:<p.*>(.*)|<span>(.*)<\/span>|<a href=\"(.*linux(?:\.tar\.gz|.zip))\">(.*)</a>.*\((.*) bytes\))

Playground https://regex101.com/r/MC8TOv/1/

Webscraped Site https://android-dot-devsite-v2-prod.appspot.com/studio/archive_25350a46834ddb86754aba2445ff1359aa7fd8cb296923255092494ac94ef531.frame

enter image description here

Python

While using BeautifulSoup

import re

...

regex = r"^.*(?:<p.*>(.*)|<span>(.*)<\/span>|<a href=\"(.*linux(?:.tar.gz|.zip))\">(?P<filename>.*)</a>.*\((.*) bytes\))"          
match = re.findall(regex, str(soup_html), re.M)

print(match)

Expectations

What I am getting This is some generic output I am getting.

 [
     ('A1', '', '', '', ''),
     ('', 'B1', '', '', ''),
     ('', '', 'C1', 'D1', 'E1'),
     ('A2', '', '', '', ''),
     ('', 'B2', '', '', ''),
     ('', '', 'C2', 'D2', 'E2'),
     ...
 ]

What I want

 [
     ('A1', 'B1', 'C1', 'D1', 'E1'),
     ('A2', 'B2', 'C2', 'D2', 'E2')
     ...
 ]

Again, is there are way to rewrite the regex to have 5 matched groups per match?

Christopher Rucinski
  • 4,737
  • 2
  • 27
  • 58

1 Answers1

0

If the five things have to appear in a sequence as your example suggests, combine them with .*? rather than |:

(regex1).*?(regex2).*?(regex3).*(regex4).*?(regex5)

instead of

(regex1)|(regex2)|(regex3).*(regex4).*?(regex5)

If, on the other hand, they don't necessarily have to appear in this order, then I don't quite see how you expect to shoehorn them into the rigid five-group structure.

Either way, there's also the possibility of post-processing the results after you've applied the regex.

NPE
  • 486,780
  • 108
  • 951
  • 1,012