0

Is there a way to show: (a) the full path to a located node? (b) show the attributes of the path nodes even if I don't know what those attributes might be called?

For example, given a page:

<!DOCTYPE html>
<HTML lang="en">
<HEAD>
  <META name="generator" content=
    "HTML Tidy for HTML5 for Linux version 5.2.0">
  <META charset="utf-8">
  <TITLE>blah ombid lipsum</TITLE>
</HEAD>
<BODY>
  <P>I'm the expected content</P>
  <DIV unexpectedattribute="very unexpected">
    <P>I'm wanted but not where you thought I'd be</P>
    <P class="strangeParagraphType">I'm also wanted text but also mislocated</P>
  </DIV>
</BODY>
</HTML>

I can find wanted text with

# Import Python libraries
import sys
from lxml import html

page = open( 'findme.html' ).read()
tree  = html.fromstring(page)

wantedText = tree.xpath(
  '//*[contains(text(),"wanted text")]' )

print( len( wantedText ), ' item(s) of wanted text found')

Having found it, however, I'd like to be able to print out the fact that the wanted text is located at: /HTML/BODY/DIV/P ... or, even better, to show that it is located at /HTML/BODY/DIV/P[2] ... and much better, to show that it is located at that location with /DIV having unexpectedattribute="very unexpected" and the final <P> having the class of strangeParagraphType.

CrimsonDark
  • 661
  • 9
  • 20
  • Don't use `open('filename.html')` and then `tree = html.fromstring()`. Use `tree = html.parse('filename.html')`. The latter has some form of file encoding autodetection. The former does not. – Tomalak Jan 25 '19 at 05:44
  • Thank you for the observation re the use of open. With regard to the possible duplicate, in my view it is not a duplicate despite the title of the other page. The referenced page doesn't deal sensibly with HTML, it uses a well-formed XML string. Additionally, the provided answers don't actually produce the result that was wanted in the OP! – CrimsonDark Jan 25 '19 at 06:16
  • HTML or XML is irrelevant. Something it can be parsed into a tree, `getpath` will get you the 2nd form of XPath you were willing to accept (`/HTML/BODY/DIV/P[2]`). – Tomalak Jan 25 '19 at 06:26
  • There is no way to get the 3rd form of XPath automatically, because that would require mind-reading. You can build it yourself by using a loop like `for parent_node in node.xpath('ancestor::*'):`, which builds a bespoke string from tag names and any attribute values you want to include. – Tomalak Jan 25 '19 at 06:31
  • Thank you for the comments. I realise it might be obvious to an expert how to modify the code in the purported duplicate to suit what I am looking for. However, the answer that is marked as correct on that page (i) uses XML whereas when I have used etree on the HTML it complains that the code is not well-formed, and (ii) that answer prints out a complete list of nodes whereas the OP actually asked about something similar to what I am looking for. Namely how to print out the path to a single, located node. And to repeat, it might be blindingly obvious to an expert ... but I'm a novice – CrimsonDark Jan 25 '19 at 06:35
  • Your mistake is that you are *copying the code* from the other answer and expect it to work. Of course the duplicate is not copy-and-paste ready, why would you think that? The core of the duplicate is to use `tree.getpath(some_node)`. And `tree.getpath()` exists on HTML trees as well, so use it on *your* code. – Tomalak Jan 25 '19 at 07:00
  • Actually no, that is not the problem. I might be a novice XPATH and LXML user but I'm not naive and far from being an inexperienced programmer. I *expect* to have to modify code in alternative examples. My problem is that it is not clear to me to what object to apply the getpath() method, and simply saying "look at the other page" does not help. With that in mind, the code as presented by @cullzie is enormously helpful. Something that is clear but less than ideal is more useful that than something ideal that is ideal but confusing. Hence my marking the answer from cullzie as correct. – CrimsonDark Jan 25 '19 at 08:18
  • 1
    `tree.getpath(wantedText[0])`, and so on. – Tomalak Jan 25 '19 at 08:19
  • The answer presented by @cullzie is wrong because it does not include positions (and not helpful because it re-implements existing functionality). – Tomalak Jan 25 '19 at 08:21
  • Thank you for persisting with me. And thank you both for the explanation, and the explanation of why the other code is not ideal. – CrimsonDark Jan 25 '19 at 08:25
  • Well, thanks for seeing it through. I really wanted you to discover the connection yourself, that's why I kept it vague. Now that you have something that works for your 2nd requirement, try if you can build a `for` loop that goes over the ancestors of each matched node (either `node.xpath('./ancestor::*')` or `node.iterancestors()`, both are equivalent) and pieces together an XPath string that fulfills your 3rd requirement. – Tomalak Jan 25 '19 at 08:53

1 Answers1

0

Could use something like this for the first example you have:

['/'.join(list([wt.tag] + [ancestor.tag for ancestor in wt.iterancestors()])[::-1]).upper() for wt in wantedText]

Third one can be created using the attrib property on the element objects and some custom logic:

wantedText[0].getparent().attrib
>>> {'unexpectedattribute': 'very unexpected'}
wantedText[0].attrib
>>> {'class': 'strangeParagraphType'}

Edit: Duplicate answer link up top is definitely a better way to go.

cullzie
  • 2,705
  • 2
  • 16
  • 21
  • I'm sorry, but I don't understand first part of the the suggested code. Could you explain where it should go in my own code. If I insert it at the end of the existing code, it doesn't produce any output and I'm not clear about what it does produce. Thanks for the examples about printing the attributes. In that instance, I am clear about how to modify if and produce the custom logic. – CrimsonDark Jan 25 '19 at 06:22
  • OK, I've worked through your code again a few times ... and now I understand it. Thank you. – CrimsonDark Jan 25 '19 at 06:41
  • No problem. I think the answer linked as a duplicate is much better and is the way it should be done. – cullzie Jan 25 '19 at 06:45