0

I have to convert some beautifulsoup code. Basically what I want is just get all children of the body node and select which has text and store them. Here is the code with bs4 :

def get_children(self, tag, dorecursive=False):

    children = []
    if not tag :
        return children

    for t in tag.findChildren(recursive=dorecursive):
        if t.name in self.text_containers \
                and len(t.text) > self.min_text_length \
                and self.is_valid_tag(t):
            children.append(t)
    return children

this works fine when I try this with lxml lib instead, children is empty :

def get_children(self, tag, dorecursive=False):

    children = []
    if not tag :
        return children

    tags = tag.getchildren()
    for t in tags:
        #print(t.tag)
        if t.tag in self.text_containers \
                and len(t.tail) > self.min_text_length \
                and self.is_valid_tag(t):
            children.append(t)
    return children

any idea ?

Dany M
  • 760
  • 1
  • 13
  • 28

1 Answers1

1

Code:

import lxml.html
import requests


class TextTagManager:
    TEXT_CONTAINERS = {
        'li',
        'p',
        'span',
        *[f'h{i}' for i in range(1, 6)]
    }
    MIN_TEXT_LENGTH = 60

    def is_valid_tag(self, tag):
        # put some logic here
        return True

    def get_children(self, tag, recursive=False):
        children = []

        tags = tag.findall('.//*' if recursive else '*')
        for t in tags:
            if (t.tag in self.TEXT_CONTAINERS and
                    t.text and
                    len(t.text) > self.MIN_TEXT_LENGTH and
                    self.is_valid_tag(t)):
                children.append(t)
        return children


manager = TextTagManager()

url = 'https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers'
html = requests.get(url).text
doc = lxml.html.fromstring(html)

for child in manager.get_children(doc, recursive=True):
    print(child.tag, ' -> ', child.text)

Output:

li  ->  HTML traversal: offer an interface for programmers to easily access and modify of the "HTML string code". Canonical example: 
li  ->  HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: 

.getchildren() returns all direct children. If you want to have a recursive option, you can use .findall():

tags = tag.findall('.//*' if recursive else '*')

This answer should help you understand the difference between .//tag and tag.

radzak
  • 2,986
  • 1
  • 18
  • 27
  • thanks for the response. i just tryed with this url: https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers and it doesn't work. in your example, please use tree.find('body') as a starter point (not your div) – Dany M May 31 '18 at 13:53
  • @DanyM yeah, I forgot that you also want to have the option to search recursively. I updated my answer (: – radzak May 31 '18 at 15:18
  • Actually I don't want recursive, it's why it is set to false by default. What I want to do is to extract content text in some containers 'p', 'div', 'article'... (not nav, header, footer...). I tried your example with the URL I provided (wikipedia) but no results. I get 5 node without the recursive mode or +365 with the recursive mode. I missed something I think – Dany M May 31 '18 at 17:28
  • @DanyM yeah probably, tbh I have no idea why you check the `len(t.tail)`? Are you sure you didn't mean to check `len(t.text)` instead? Let me know how that works. – radzak May 31 '18 at 17:36
  • thanks I get your code work if i test it separatly, but integrated in my code, it doesn't work as expected. I get all my "text" attributes empty, so it don't parse anything... have to investigate – Dany M Jun 01 '18 at 06:47