0

I need some guidance with refining my regex. I have the source of a webpage, and would like to extract the href's from the page. the table doesn't have any ID's or class. I have decided to use regex, however my expression seems to be matching more than I want. I have tried the following:

http:\/\/(.*?)(?=.*showuri)(.*?)responseType=xml\">\/lnc\/

my start is http:// the end is responseType=xml">/lnc/ and I need the middle bit to contain the word showuri

I am using Python 3

qbbq
  • 347
  • 1
  • 15
  • Maybe you should use a parser first to get all hrefs (see [this post](https://stackoverflow.com/questions/3075550/how-can-i-get-href-links-from-html-using-python)) then filter the results on contains `responseType=xml>/lnc/` – ctwheels Nov 28 '19 at 04:22
  • Don’t use RegEx for this. – AMC Nov 28 '19 at 04:48

1 Answers1

0

The method I used for this is as follows:

doc = html.fromstring(text)
tr_elements = doc.xpath('//a/@href')
df = pd.DataFrame(tr_elements)
df.columns=['URL']

from this point, I will drop rows that do no contain "showuri"

qbbq
  • 347
  • 1
  • 15