0

I'm working on a function that determines if the content of a given html element - el - in an lxml ElementTree is the leading content of a line in a rendered HTML page. To do this, I'm trying to find the right-most block level element that is left of el, and then determine if there is content between these two.

I figure this can happen via a traversal in the reverse order of a DFS, with the reverse traversal starting at el. But I've also been trying to find if a simpler method exists using lxml or xpath to do this. So far I've found ways to find elements that are ancestors or left siblings of a given element with some criteria, but I haven't spotted anything that works on the entire tree right (or left) of a specific node.

Does anybody know of a simpler way to do this search using lxml or xpath?

Example

<html>
<body class="first">
root
<!-- A span that does not have its own content, but does have several levels of children-->
<span>
  <a>
    <b>
      <h1 class="first">
        A block level that is the decendant of several non block levels
      </h1>
    </b>
  </a>
  <span class="first" id="tricky">
    A non-block level that has no block levels among its ancestors, but a block level element among its left cousins
  </span>
  <span>
    A non-block level that has no block levels among its ancestors, and content between itself and its nearest left-cousin block level
  </span>
</span>
<div class="first">
a block level
</div>
<div>
<span class="first">first content in a non block level in a block level</span>
<span>following content in a non block level in a block level</span>
</div>
<div>
  <i>  </i><bclass="first">a non block level that contains the first content within a block level, but follows an empty non-block level</b>
</div>
</body>
</html>

In the above I've added a "first" class to any element that, when rendered, would appear to present the leading content of a line. Of particular interest is the element with id "tricky", because that element will present the first content of a line even though none of its ancestors nor its siblings are block level elements. "tricky" will be on a new line because a descendant of one of its siblings (the h1) is a block level, and there is no other content following that h1.

Follow Up At this point I have written a function in Python that does a type of backwards traversal. Its a bit complicated, but it seems to work:

block_level = {'blockquote','br','dd','div','dl','dt','h1','h2','h3','h4','h5','h6','hr','li','ol','p','pre','td','ul'}

# Returns true if the content of the provided element is the leading content of a line
# This function runs on HTML elements before any translation occurs
# Here 'content' refers to non-whiespace characters
def is_first_in_line_html(self, el):
    # This element contains no content, so it can't be the leading content of a line.
    if el.text is None or el.text.strip() == '': return False

    # This element has content and is a block level, so its content is the leading content of a line.
    if el.tag in block_level: return True

    # This element has content, is not a block level, and is the body element. Definitely leading content of a line.
    if el.tag == 'body': return True

    # Final case - is there content between the present element and the nearest block level element to the left of the present
    # element.    

    def traverse_children(element, bound_text):
        children = element.iterchildren(reversed=True)
        for child in children:
            if child.tail is not None: bound_text = child.tail + bound_text
            if bound_text.strip() != '': return False
            if child.tag in block_level: return bound_text.strip() == ''
            rst_children = traverse_children(child, bound_text)
            if rst_children is not None: return rst_children
            if child.text is not None: bound_text = child.text + bound_text
            if bound_text.strip() != '': return False
        return None

    def traverse_left_sibs_and_ancestors(element, bound_text):
        left_sibs = element.itersiblings(preceding=True)
        for sib in left_sibs:
            if sib.tail is not None: bound_text = sib.tail + bound_text
            if bound_text.strip() != '': return False
            if sib.tag in block_level: return bound_text.strip() == ''
            rst_children = traverse_children(sib, bound_text)
            if rst_children is not None: return rst_children
            if sib.text is not None: bound_text = sib.text + bound_text
            if bound_text.strip() != '': return False
        parent = element.getparent()
        if parent.tail is not None: bound_text = parent.tail + bound_text
        if parent.tag == 'body': return bound_text.strip() == ''
        if parent.tag in block_level: return bound_text.strip() == ''
        return traverse_left_sibs_and_ancestors(parent)

    return traverse_left_sibs_and_ancestors(el, '')
Bill Bushey
  • 143
  • 1
  • 6
  • 3
    I wish you posted a simplified example of your document and illustrated a context of where you are and the area (or a node) that you want to find. You can search the nodes "before" you (in the document order) using the `preceding` axes and the nodes "after" you using the `following` axes. But it sounds like you were there already and it didn't help. so maybe thrown in an example? – Pavel Veller May 14 '12 at 18:35
  • Oy, sorry, I completely forgot about this question. Very bad etiquette on my part for both leaving this question and not providing an example as Pavel and Dimitre have said. Editing now to add an example. – Bill Bushey Jun 16 '12 at 18:00

0 Answers0