Beautiful Soup: extracting tagged and untagged HTML text

Question

As a novice with bs4 I'm looking for some help in working out how to extract the text from a series of webpage tables, one of which is like this:

<table style="padding:0px; margin:1px" width="715px">
<tr>
<td height="22" width="33%" >
<span class="darkGreenText"><strong> Name: </strong></span>
Tyto alba
</td>
<td height="22" width="33%" >
<span class="darkGreenText"><strong> Order: </strong></span>
Strigiformes
</td>
<td height="22" width="33%">
<span class="darkGreenText"><strong> Family: </strong></span>
Tytonidae
</td>
<td height="22" width="66%" colspan="2">
<span class="darkGreenText"><strong> Status: </strong></span>
Least Concern
</td>
</tr>
</table>

Desired output:

Name: Tyto alba
Order: Strigiformes
Family: Tytonidae
Status: Least Concern

I've tried using [index] as recommended (https://stackoverflow.com/a/35050622/1726290), and also next_sibling (https://stackoverflow.com/a/23380225/1726290) but I'm getting stuck as one part of the text I need is tagged and the second part is not. Any help would be appreciated.

Dear Mr Cunningham, I 'm sorry that you felt it necessary to post a comment which shows so little empathy for a novice in this field. What may be blindingly obvious to you may not be quite so clear to the questioner. — MichaelMaggs, Sep 28 '16 at 20:58
It would be blindingly obvious to you too if you RTFM, this is something anyone would know after five minutes of reading the excellent, comprehensive and easy to comprehend bs4 docs. There is a difference between being a novice and not making a basic effort. — Padraic Cunningham, Sep 28 '16 at 21:03

score 2 · Accepted Answer · answered Sep 28 '16 at 19:05

2

It seems like what you want is to call get_text(strip=True)(docs) on the BeautifulSoup Tag. Assuming raw_html is the html you pasted above:

htmlSoup = BeautifulSoup(raw_html) for tag in htmlSoup.select('td'): print(tag.get_text(strip=True))

which prints:

Name:Tyto alba Order:Strigiformes Family:Tytonidae Status:Least Concern

answered Sep 28 '16 at 19:05

Thang

135
7

1

Thanks very much, that's exactlyy what I was looking for. I'll follow up with some more reading on `get_text(strip=True)` – MichaelMaggs Sep 28 '16 at 21:03

Beautiful Soup: extracting tagged and untagged HTML text

1 Answers1