Python regular expression match start and end string and must contain specific word

Question

I need some guidance with refining my regex. I have the source of a webpage, and would like to extract the href's from the page. the table doesn't have any ID's or class. I have decided to use regex, however my expression seems to be matching more than I want. I have tried the following:

http:\/\/(.*?)(?=.*showuri)(.*?)responseType=xml\">\/lnc\/

my start is http:// the end is responseType=xml">/lnc/ and I need the middle bit to contain the word showuri

I am using Python 3

Maybe you should use a parser first to get all hrefs (see [this post](https://stackoverflow.com/questions/3075550/how-can-i-get-href-links-from-html-using-python)) then filter the results on contains `responseType=xml>/lnc/` — ctwheels, Nov 28 '19 at 04:22

score 0 · Answer 1 · answered Nov 28 '19 at 05:01

0

The method I used for this is as follows:

doc = html.fromstring(text)
tr_elements = doc.xpath('//a/@href')
df = pd.DataFrame(tr_elements)
df.columns=['URL']

from this point, I will drop rows that do no contain "showuri"

answered Nov 28 '19 at 05:01

qbbq

347
1
15

thanks to @ctwheels for a similar method of approaching this – qbbq Nov 28 '19 at 05:02

Python regular expression match start and end string and must contain specific word

1 Answers1