From the following HTML code I need to extract the strings inside the code tag and anchor tag.
<a class="reference internal" href="optparse.html">
<code class="xref py py-mod docutils literal notranslate">
<span class="pre">
optparse
</span>
</code>
— Parser for command line options
</a>
I am using this script:
from bs4 import BeautifulSoup
with open("index.html", encoding="utf-8") as fp:
bs = BeautifulSoup(fp, 'html.parser')
i = 0
for link in bs.find_all('a'):
if "#" not in link.attrs["href"]:
if link.find("code") :
print (link.text, link.find("code").text)
With the python script above I get:
optparse
— Parser for command line options
optparse
The problem is that I need the to get "optparse" and "— Parser for command line options" separately. The function link.text from beautiful soup is getting all the text inside the anchor tag including the text inside the code tag.
How can I get both strings separately?
` or another element is added, then the indices shift and we'll crash or get the wrong output. The safer way to get `optparse` is shown in [this answer](https://stackoverflow.com/a/31909680/6243352), then use a regular selector instead of `contents[1]`. Happy to remove the downvote if you use that instead of the two approaches shown here.
– ggorlen Jul 20 '21 at 23:06