0

I'm practicing using BS4 to parse HTML files. I've encountered a certain issue and I can't seem to find the solution anywhere. How would I parse the inside of an an anchor tag? I've tried specifying the "href" tag but the link has some added characters which breaks the href tag.

For instance, I am trying to parse this link to one of my older questions:

<a href = "https://stackoverflow.com/questions/61925957/using-an-api-to-create-data-in-a-react-table" style=
=3D"color: #FFFFFF;font-size: 15px;"> >

But, instead it has some characters which breaks the tag:

<a href = "https://stackoverflow.com/&amp=3D"questions/61925957"=3D"/using-an-api-to-create-data-in-a-react-table" style=
=3D"color: #FFFFFF;font-size: 15px;" >

How would I get the inside of this tag using bs4 so that I can trim it and get my final link? I want to also ignore the style, color and font-size descriptors.

  • 1
    Please update your question with your attempt in the form of a [mre]. I can't reproduce your issue. – baduker Mar 07 '23 at 08:05

1 Answers1

1

I can't reproduce the issue, this works just fine:

from bs4 import BeautifulSoup

html_sample = """<a href = "https://stackoverflow.com/questions/61925957/using-an-api-to-create-data-in-a-react-table" style=
=3D"color: #FFFFFF;font-size: 15px;"> >"""

soup = BeautifulSoup(html_sample, "lxml").select_one("a")["href"]
print(soup)

Output:

https://stackoverflow.com/questions/61925957/using-an-api-to-create-data-in-a-react-table
baduker
  • 19,152
  • 9
  • 33
  • 56