Extract text from inline children of an anchor tag

Question

From the following HTML code I need to extract the strings inside the code tag and anchor tag.

 <a class="reference internal" href="optparse.html">
           <code class="xref py py-mod docutils literal notranslate">
            <span class="pre">
             optparse
            </span>
           </code>
           — Parser for command line options
          </a>

I am using this script:

from bs4 import BeautifulSoup

with open("index.html", encoding="utf-8") as fp:
    bs = BeautifulSoup(fp, 'html.parser') 
i = 0
for link in bs.find_all('a'):
    if "#" not in link.attrs["href"]:
        if link.find("code") :
            print (link.text, link.find("code").text)

With the python script above I get:

     optparse


   — Parser for command line options


     optparse

The problem is that I need the to get "optparse" and "— Parser for command line options" separately. The function link.text from beautiful soup is getting all the text inside the anchor tag including the text inside the code tag.

How can I get both strings separately?

MendelG · Answer 1 · 2021-07-20T22:53:49.873

1

You can use .contents to get specific text values

from bs4 import BeautifulSoup

html = """ 
<a class="reference internal" href="optparse.html">
 <code class="xref py py-mod docutils literal notranslate">
  <span class="pre">
   optparse
  </span>
 </code>
 — Parser for command line options
</a>
"""
soup = BeautifulSoup(html, "html.parser")

for tag in soup.find_all("a", class_="reference internal"):
    opt, optparse = tag.contents[1].text.strip(), tag.contents[2].strip()
    print(opt)
    print(optparse)

Or (this will break if your text contains | as pointed out in the comments):

for tag in soup.find_all("a", class_="reference internal"):
    optparse, parser = tag.get_text(separator="|", strip=True).split("|")
    print(optparse)
    print(parser)

Output (both examples):

optparse
— Parser for command line options

edited Jul 20 '21 at 22:53

answered Jul 20 '21 at 22:36

MendelG

14,885
4
25
52

1

Sure -- this is brittle. It will break if `|` happens to be in the text. There are better ways to do this. For example, [retrieve the text contents from the parent tag non-recursively](https://stackoverflow.com/questions/4995116/only-extracting-text-from-this-element-not-its-children), then dip into children and retrieve their text contents. – ggorlen Jul 20 '21 at 22:41
1

@ggorlen Good point, thank you. I'll leave it at this since the _provided_ HTML doesn't contain `|` – MendelG Jul 20 '21 at 22:45
1

After the update, the `contents` approach is better, but it still seems brittle. For example, if whitespace disappears between `` and `` or another element is added, then the indices shift and we'll crash or get the wrong output. The safer way to get `optparse` is shown in [this answer](https://stackoverflow.com/a/31909680/6243352), then use a regular selector instead of `contents[1]`. Happy to remove the downvote if you use that instead of the two approaches shown here. – ggorlen Jul 20 '21 at 23:06

ggorlen · Accepted Answer · 2021-07-20T22:59:20.110

You can grab the shallow text using tag.find_all(text=True, recursive=False) as described here, then use a normal selector to pull out the deeper text from the span.

This way, all of the data is separate from the start and you're not dealing with parsing the individual pieces from the smushed text in the parent's view after the fact.

from bs4 import BeautifulSoup

html = """ 
<a class="reference internal" href="optparse.html">
 <code class="xref py py-mod docutils literal notranslate">
  <span class="pre">
   optparse
  </span>
 </code>
 — Parser for command line options
</a>
"""
soup = BeautifulSoup(html, "html.parser")

for tag in soup.select("a.reference.internal"):
    shallow = "".join(tag.find_all(text=True, recursive=False)).strip()
    deep = tag.find("code").text.strip()
    print(repr(shallow)) # => '— Parser for command line options'
    print(repr(deep))    # => 'optparse'

Extract text from inline children of an anchor tag

2 Answers2