16

Is there a way to do a DFT on a BeautifulSoup parse tree? I'm trying to do something like starting at the root, usually , get all the child elements and then for each child element get their children, etc until I hit a terminal node at which point I'll build my way back up the tree. Problem is I can't seem to find a method that will allow me to do this. I found the findChildren method but that seems to just put the entire page in a list multiple times with each subsequent entry getting reduced. I might be able to use this to do a traversal however other than the last entry in the list it doesn't appear there is any way to identify entries as terminal nodes or not. Any ideas?

Ian Burris
  • 6,325
  • 21
  • 59
  • 80
  • I am not aware of a solution with BeautifulSoup, but I have a solution with the lxml library that works. Is using BeautifulSoup a compulsion? If not, I can suggest a solution with lxml. – user225312 Jan 27 '11 at 09:10

2 Answers2

19

mytag.find_all() already does that:

If you call mytag.find_all(), Beautiful Soup will examine all the descendants of mytag: its children, its children’s children, and so on

from bs4 import BeautifulSoup  # pip install beautifulsoup4

soup = BeautifulSoup("""<!doctype html>
<div id=a>A
  <div id=1>A1</div>
  <div id=2>A2</div>
</div>
<div id=b>B
  <div id=I>BI</div>
  <div id=II>BII</div>
</div>
""")

for div in soup.find_all("div", recursive=True):
    print(div.get('id'))

Output

a
1
2
b
I
II

The output confirms that, it is a depth first traversal.


Old Beautiful Soup 3 answer:

recursiveChildGenerator() already does that:

soup = BeautifulSoup.BeautifulSoup(html)
for child in soup.recursiveChildGenerator():
     name = getattr(child, "name", None)
     if name is not None:
         print name
     elif not child.isspace(): # leaf node, don't print spaces
         print child

Output

For the html from @msalvadores's answer:

html
ul
li
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
li
Aliquam tincidunt mauris eu risus.
li
Vestibulum auctor dapibus neque.
html

NOTE: html is printed twice due to the example contains two opening <html> tags.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • @Sebastian I fixed the html example in case you want to edit your answer. Good to know that recursiveChildGenerator exists (+1 to that). – Manuel Salvadores Jan 27 '11 at 11:29
  • 2
    Thanks - this worked for me! For those who might be looking for documentation on this, in BeautifulSoup 4 `recursiveChildGenerator()` was renamed `descendants()`. – J. Taylor Jan 25 '19 at 07:59
  • @J.Taylor And the documentation on `descendants`(a property, so no parenthesis) state that it performs a breadth-first traversal. I'm not sure if this answer is still valid for current bs4 versions? – Charles Langlois Oct 20 '20 at 20:21
  • 2
    @CharlesLanglois: to get DFT, `find_all(recursive=True)` can be used in bs4. I've updated the answer. – jfs Oct 21 '20 at 16:32
  • Use `soup.find_all(recursive=True)` simply instead of `soup.find_all("div", recursive=True)` to get all children in the general case. – Bálint Sass Jan 12 '23 at 13:50
6

I think you can use the method "childGenerator" and recursively use this one to parse the tree in a DFT fashion.

def recursiveChildren(x):
   if "childGenerator" in dir(x):
      for child in x.childGenerator():
          name = getattr(child, "name", None)
          if name is not None:
             print "[Container Node]",child.name
          recursiveChildren(child)
    else:
       if not x.isspace(): #Just to avoid printing "\n" parsed from document.
          print "[Terminal Node]",x

if __name__ == "__main__":
    soup = BeautifulSoup(your_data)
    for child in soup.childGenerator():
        recursiveChildren(child)

With "childGenerator" in dir(x) we make sure that an element is a container, terminal nodes such as NavigableStrings are not containers and do not contain children.

For some example HTML like:

<html>
<ul>
   <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
   <li>Aliquam tincidunt mauris eu risus.</li>
   <li>Vestibulum auctor dapibus neque.</li>
</ul>
</html>

This scripts prints ...

[Container Node] ul
[Container Node] li
[Terminal Node] Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
[Container Node] li
[Terminal Node] Aliquam tincidunt mauris eu risus.
[Container Node] li
[Terminal Node] Vestibulum auctor dapibus neque.
Manuel Salvadores
  • 16,287
  • 5
  • 37
  • 56
  • 1
    There are number of problems in your code: 1. BeautifulSoup already does DFT http://stackoverflow.com/questions/4814317/depth-first-traversal-on-beautifulsoup-parse-tree/4815399#4815399 2. don't use `dir()`, `.__dict__` use `hasattr()` instead. 3. `x` already has required string methods; you don't need to use `str(x)` (it is redundant at best and it will fail if `x` contains non-ascii characters) – jfs Jan 27 '11 at 11:05
  • I liked this answer better as with slight midification, it is possible to keep track of the parent tags of the terminal nodes which might be the reason why depth-first iteration is needed. The fact that BeautifulSoup does this already is not a problem @J.F.Sebastian – bfaskiplar Dec 12 '16 at 14:27