-1

I want to extract Text 1, Text 2 and Text 3 from the HTML below in one go using XPath. Is that possible?

When I run

//div/strong/a/text()/../../../div/span/span/span/span/text() 

I only get Text 2 (I haven't included the path for Text 3 just yet).

<div>
    <strong>
    <a>
    Text 1                            
    </a>
    </strong>
    <div>
        <span>
        <span>
        <span>
        <span >
        Text 2
        </span>
        </span>
        </span>
        </span>
        <span>
        <span>
        <span>
        <span>
        Text 3
        </span>
        </span>
        </span>
        </span>
    </div>
</div>

I have read several other questions, like these

XPath: How to collect multiple texts fragments from an XHTML node?

XPath expression: selecting text nodes between element nodes

but none of them applies to my situation.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
d-b
  • 695
  • 3
  • 14
  • 43
  • 1
    `//text()` will give you all text nodes in a document. Alternatively, *xpath1* `|` *xpath2* `|` *xpath3* will give you the union of the listed XPath expressions. If you wish to be more discriminating, you'll have to clarify your requirements. – kjhughes Feb 15 '22 at 19:43
  • I am not sure I follow you here. The thing is that sometimes Text 2 or Text 3 is missing and then I am not interested in Text 1. – d-b Feb 15 '22 at 19:49

1 Answers1

1

There is no single XPath expression that will match all the 3 texts here separately.
As mentioned by kjhugnes you can use //text() to get all the 3 texts together since they are all inside the root div element.
You can get the text1 separately with //div/strong/a/text() XPath and the text2 and text3 with //div//span/text()

Prophet
  • 32,350
  • 22
  • 54
  • 79
  • Hmm, the reason I step upwards to Text 2 (and Text 3) is that I don't care about Text 1 unless Text 2 and Text 3 also exist. Instead of what I try in my question, should I just keep the first `text()` and step up to T2 and T3 and just check their existence instead? – d-b Feb 16 '22 at 10:09
  • it depends on your specific needs / flow. You can validate existence and values of text2 and / or text3 first and only then to check / to get the value of text1. It is 100% possible here. You can use Xpaths I mentioned here for that. – Prophet Feb 16 '22 at 10:14
  • My flow is completely manual. I execute this in the browser console once or twice for a specific site and then a week or month later I do it on another site with a similar structure (several "main cells" with 2-3 "subcells" with a text string in them that I want to extract). I then insert the text into a spreadsheet for further manipulation. – d-b Feb 17 '22 at 11:53
  • 1
    So, what is the question now? – Prophet Feb 17 '22 at 11:55
  • What is unclear? I want a "one liner" that extracts Text 1, 2 and 3 if and only if, Text 2 and 3 exists, that I can execute in my browser console. – d-b Feb 19 '22 at 11:29