0

I've got this snippet:

<div>
    <p>This content is flagged as <code>TODO</code>. It has another element of <code>TODO</code> neither of the nested elements should be a child of div.</p>
</div>

in lxml I'm using the xpath selection of:

//div/child::* This is giving me a list of three items p, code, and code. However, code and code shouldn't be considered children of div, but descendants of div. In fact, when I use //div/descendant::* I get the same three items. Ultimately I don't know how many children will be in a div, could be 1 or 100, but I only need the direct child, not any descendant elements.

Anyone have any ideas of how to do this with lxml? Or is it simply a bug I'll have to wait to get fixed?

LMC
  • 10,453
  • 2
  • 27
  • 52
Wayne Brissette
  • 135
  • 2
  • 8
  • Isn't `//div/child::*` exactly equivalent to `//div/*`? In any case, with lxml 4.9.2 and your sample data, `//div/child::*` returns only a single result, `[]`. If you're seeing different behavior, could you update your question to include a complete [mcve]? – larsks Jun 29 '23 at 18:41
  • Interestingly I'm not seeing this in the large dataset. I've found a workaround, but something isn't right somewhere (could very well be in how my looping is occurring in the larger dataset). Thanks for confirming this however! – Wayne Brissette Jun 30 '23 at 09:55
  • So, I finally figured this one out. I'm sure there's documentation that says don't do this... but I had an if statement that was ```if mycode.xpath('.//div/child::*'):``` ... it was that statement failing. That is, lxml if I assigned a variable to that same xpath gave me exactly what I wanted and I could then check to see if it contained something, just not short cut around it. Again, thanks everybody. I just wanted to follow up in case anybody else falls into this trap. – Wayne Brissette Jul 01 '23 at 11:31

1 Answers1

0

lxml returns a single element p which however contains the code elements

>>> from lxml import html
>>> doc = html.parse('tmp.html')
>>> doc.xpath('//div/child::*')
[<Element p at 0x7efcc04d8e08>]
>>> doc.xpath('//div/child::*')[0].xpath('//code')
[<Element code at 0x7efcc04ea318>, <Element code at 0x7efcc04fbc28>]

For //div/descendant::* it returns 3 nodes as it should

>>> doc.xpath('//div/descendant::*')
[<Element p at 0x7efcc04d8e08>, <Element code at 0x7efcc04ea318>, <Element code at 0x7efcc04fbc28>]
LMC
  • 10,453
  • 2
  • 27
  • 52