1

I work in program FreeMind which allows to create trees and import them as HTML files and I need to get every "path" of this tree and put them into list for example to work with each "path" separately after. enter image description here

For example from this code:

<body>
   <p>Example</p>
   <ul>
      <li>
         The First List
         <ul>
            <li>1</li>
            <li>2</li>
            <li>3</li>
         </ul>
      </li>
      <li>
         The Second List
         <ul>
            <li>4.1</li>
            <li>4.2</li>
         </ul>
      </li>
   </ul>
</body>

I need to get next separate branches of code:

<body>
   <p>Example</p>
   <ul>
      <li>
         The First List
         <ul>
            <li>1</li>
         </ul>
      </li>
   </ul>
</body>

<body>
   <p>Example</p>
   <ul>
      <li>
         The First List
         <ul>
            <li>2</li>
         </ul>
      </li>
   </ul>
</body>

<body>
   <p>Example</p>
   <ul>
      <li>
         The First List
         <ul>
            <li>3</li>
         </ul>
      </li>
   </ul>
</body>

<body>
   <p>Example</p>
   <ul>
      <li>
         The Second List
         <ul>
            <li>4.1</li>
         </ul>
      </li>
   </ul>
</body>

<body>
   <p>Example</p>
   <ul>
      <li>
         The Second List
         <ul>
            <li>4.2</li>
         </ul>
      </li>
   </ul>
</body>

I am trying that code and getting error "maximum recursion depth exceeded while calling a Python object":

from bs4 import BeautifulSoup

parsed = BeautifulSoup(open("example.html"))

body = parsed.body

def all_nodes(obj):
    for node in obj:
        print node
        all_nodes(node)

print all_nodes(body)

I think that I should explain what I want to do with all this stuff later. I am writing test cases in FreeMind and I am trying to write tool which could create csv table for example with all test cases. But for now I am just trying to get all test cases as texts.

Kirill
  • 1,530
  • 5
  • 24
  • 48
  • Do you need a generic solution for any depth and any elements, or you know beforehand the depth and the document structure? Thanks. – alecxe Apr 07 '14 at 02:09
  • I don't know exactly depth. I tried to figure out depth for recursion limit but was getting the same error. – Kirill Apr 07 '14 at 02:10

1 Answers1

2

Here's one way to do it. It's not that easy and pythonic though. Personally I don't like the solution, but it should be a good start for you. I bet there is a more beautiful and short way to do the same.

The idea is to iterate over all elements that don't have children. For every such element iterate recursively over it's parents until we hit body:

from bs4 import BeautifulSoup, Tag


data = """
your xml goes here
"""
soup = BeautifulSoup(data)
for element in soup.body.find_all():
    children = element.find_all()
    if not children:
        tag = Tag(name=element.name)
        tag.string = element.string
        for parent in element.parentGenerator():
            parent = Tag(name=parent.name)
            parent.append(tag)
            tag = parent
            if tag.name == 'body':
                break
        print tag

It produces:

<body><p>Example</p></body>
<body><ul><li><ul><li>1</li></ul></li></ul></body>
<body><ul><li><ul><li>2</li></ul></li></ul></body>
<body><ul><li><ul><li>3</li></ul></li></ul></body>
<body><ul><li><ul><li>4.1</li></ul></li></ul></body>
<body><ul><li><ul><li>4.2</li></ul></li></ul></body>

UPD (writing parent's text too):

soup = BeautifulSoup(data)
for element in soup.body.find_all():
    children = element.find_all()
    if not children:
        tag = Tag(name=element.name)
        tag.string = element.string
        for parent in element.parentGenerator():
            parent_tag = Tag(name=parent.name)
            if parent.string:
                parent_tag.string = parent.string
            parent_tag.append(tag)
            tag = parent_tag
            if tag.name == 'body':
                break
        print tag

Hope that helps.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I'm delving into documentation right now, but do you think `SoupStrainer` can isolate children-less tags ahead of time? So far, this does not seem the case... – WGS Apr 07 '14 at 03:00
  • @Nanashi yeah, this is what I didn't like about the solution - overhead of going through all of the elements recursively. `SoupStrainer` wouldn't help I think, cause we'll need all of the parents at some point..basically we need all nodes out there. – alecxe Apr 07 '14 at 03:02
  • Thank you. Works perfectly but what if I want to get every parent's content as well? – Kirill Apr 07 '14 at 07:00
  • @KirillZhukov see the `UPD` section, dolzhno rabotat' - give it a try. – alecxe Apr 07 '14 at 13:25
  • Thanks, but I guess parent.string is always None and I am getting the same result. Just tried element.string instead of parent.string it creates odd code like '1
      1
    • 1
        1
      • 1
    .' See my updates please.
    – Kirill Apr 07 '14 at 14:27
  • @KirillZhukov oops, I see that `The First List` and `The Second List`, let me check. – alecxe Apr 07 '14 at 14:32
  • @alecxe: Sorry to post here, but is it possible to get off-SO help from you? I'm studying something right now and I'd appreciate if you have some time to help. Thanks! – WGS Apr 10 '14 at 16:28
  • @Nanashi sure, there is a linkedin link in my profile - just throw me a message. – alecxe Apr 10 '14 at 16:29