Is there a way to do a DFT on a BeautifulSoup parse tree? I'm trying to do something like starting at the root, usually , get all the child elements and then for each child element get their children, etc until I hit a terminal node at which point I'll build my way back up the tree. Problem is I can't seem to find a method that will allow me to do this. I found the findChildren method but that seems to just put the entire page in a list multiple times with each subsequent entry getting reduced. I might be able to use this to do a traversal however other than the last entry in the list it doesn't appear there is any way to identify entries as terminal nodes or not. Any ideas?
Asked
Active
Viewed 1.5k times
16
-
I am not aware of a solution with BeautifulSoup, but I have a solution with the lxml library that works. Is using BeautifulSoup a compulsion? If not, I can suggest a solution with lxml. – user225312 Jan 27 '11 at 09:10
2 Answers
19
mytag.find_all()
already does that:
If you call mytag.find_all(), Beautiful Soup will examine all the descendants of mytag: its children, its children’s children, and so on
from bs4 import BeautifulSoup # pip install beautifulsoup4
soup = BeautifulSoup("""<!doctype html>
<div id=a>A
<div id=1>A1</div>
<div id=2>A2</div>
</div>
<div id=b>B
<div id=I>BI</div>
<div id=II>BII</div>
</div>
""")
for div in soup.find_all("div", recursive=True):
print(div.get('id'))
Output
a
1
2
b
I
II
The output confirms that, it is a depth first traversal.
Old Beautiful Soup 3 answer:
recursiveChildGenerator()
already does that:
soup = BeautifulSoup.BeautifulSoup(html)
for child in soup.recursiveChildGenerator():
name = getattr(child, "name", None)
if name is not None:
print name
elif not child.isspace(): # leaf node, don't print spaces
print child
Output
For the html from @msalvadores's answer:
html
ul
li
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
li
Aliquam tincidunt mauris eu risus.
li
Vestibulum auctor dapibus neque.
html
NOTE: html
is printed twice due to the example contains two opening <html>
tags.

jfs
- 399,953
- 195
- 994
- 1,670
-
@Sebastian I fixed the html example in case you want to edit your answer. Good to know that recursiveChildGenerator exists (+1 to that). – Manuel Salvadores Jan 27 '11 at 11:29
-
2Thanks - this worked for me! For those who might be looking for documentation on this, in BeautifulSoup 4 `recursiveChildGenerator()` was renamed `descendants()`. – J. Taylor Jan 25 '19 at 07:59
-
@J.Taylor And the documentation on `descendants`(a property, so no parenthesis) state that it performs a breadth-first traversal. I'm not sure if this answer is still valid for current bs4 versions? – Charles Langlois Oct 20 '20 at 20:21
-
2@CharlesLanglois: to get DFT, `find_all(recursive=True)` can be used in bs4. I've updated the answer. – jfs Oct 21 '20 at 16:32
-
Use `soup.find_all(recursive=True)` simply instead of `soup.find_all("div", recursive=True)` to get all children in the general case. – Bálint Sass Jan 12 '23 at 13:50
6
I think you can use the method "childGenerator" and recursively use this one to parse the tree in a DFT fashion.
def recursiveChildren(x):
if "childGenerator" in dir(x):
for child in x.childGenerator():
name = getattr(child, "name", None)
if name is not None:
print "[Container Node]",child.name
recursiveChildren(child)
else:
if not x.isspace(): #Just to avoid printing "\n" parsed from document.
print "[Terminal Node]",x
if __name__ == "__main__":
soup = BeautifulSoup(your_data)
for child in soup.childGenerator():
recursiveChildren(child)
With "childGenerator" in dir(x)
we make sure that an element is a container, terminal nodes such as NavigableStrings
are not containers and do not contain children.
For some example HTML like:
<html>
<ul>
<li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
<li>Aliquam tincidunt mauris eu risus.</li>
<li>Vestibulum auctor dapibus neque.</li>
</ul>
</html>
This scripts prints ...
[Container Node] ul
[Container Node] li
[Terminal Node] Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
[Container Node] li
[Terminal Node] Aliquam tincidunt mauris eu risus.
[Container Node] li
[Terminal Node] Vestibulum auctor dapibus neque.

Manuel Salvadores
- 16,287
- 5
- 37
- 56
-
1There are number of problems in your code: 1. BeautifulSoup already does DFT http://stackoverflow.com/questions/4814317/depth-first-traversal-on-beautifulsoup-parse-tree/4815399#4815399 2. don't use `dir()`, `.__dict__` use `hasattr()` instead. 3. `x` already has required string methods; you don't need to use `str(x)` (it is redundant at best and it will fail if `x` contains non-ascii characters) – jfs Jan 27 '11 at 11:05
-
I liked this answer better as with slight midification, it is possible to keep track of the parent tags of the terminal nodes which might be the reason why depth-first iteration is needed. The fact that BeautifulSoup does this already is not a problem @J.F.Sebastian – bfaskiplar Dec 12 '16 at 14:27