1

From the following HTML code I need to extract the strings inside the code tag and anchor tag.

 <a class="reference internal" href="optparse.html">
           <code class="xref py py-mod docutils literal notranslate">
            <span class="pre">
             optparse
            </span>
           </code>
           — Parser for command line options
          </a>

I am using this script:

from bs4 import BeautifulSoup

with open("index.html", encoding="utf-8") as fp:
    bs = BeautifulSoup(fp, 'html.parser') 
i = 0
for link in bs.find_all('a'):
    if "#" not in link.attrs["href"]:
        if link.find("code") :
            print (link.text, link.find("code").text)

With the python script above I get:

     optparse


   — Parser for command line options


     optparse

The problem is that I need the to get "optparse" and "— Parser for command line options" separately. The function link.text from beautiful soup is getting all the text inside the anchor tag including the text inside the code tag.

How can I get both strings separately?

Carlitos_30
  • 371
  • 4
  • 13

2 Answers2

1

You can use .contents to get specific text values

from bs4 import BeautifulSoup

html = """ 
<a class="reference internal" href="optparse.html">
 <code class="xref py py-mod docutils literal notranslate">
  <span class="pre">
   optparse
  </span>
 </code>
 — Parser for command line options
</a>
"""
soup = BeautifulSoup(html, "html.parser")

for tag in soup.find_all("a", class_="reference internal"):
    opt, optparse = tag.contents[1].text.strip(), tag.contents[2].strip()
    print(opt)
    print(optparse)

Or (this will break if your text contains | as pointed out in the comments):

for tag in soup.find_all("a", class_="reference internal"):
    optparse, parser = tag.get_text(separator="|", strip=True).split("|")
    print(optparse)
    print(parser)

Output (both examples):

optparse
— Parser for command line options
MendelG
  • 14,885
  • 4
  • 25
  • 52
1

You can grab the shallow text using tag.find_all(text=True, recursive=False) as described here, then use a normal selector to pull out the deeper text from the span.

This way, all of the data is separate from the start and you're not dealing with parsing the individual pieces from the smushed text in the parent's view after the fact.

from bs4 import BeautifulSoup

html = """ 
<a class="reference internal" href="optparse.html">
 <code class="xref py py-mod docutils literal notranslate">
  <span class="pre">
   optparse
  </span>
 </code>
 — Parser for command line options
</a>
"""
soup = BeautifulSoup(html, "html.parser")

for tag in soup.select("a.reference.internal"):
    shallow = "".join(tag.find_all(text=True, recursive=False)).strip()
    deep = tag.find("code").text.strip()
    print(repr(shallow)) # => '— Parser for command line options'
    print(repr(deep))    # => 'optparse'
ggorlen
  • 44,755
  • 7
  • 76
  • 106