-1

I have this example HTML:

<div class="_bns--table">
    
    <table class="bns--table" border="0" cellspacing="0" cellpadding="0" width="737">
<tbody><tr><td width="151" colspan="2" rowspan="2"><p><b>Field Name</b></p>
</td>
<td width="161"><p><b>GG Text (TXT)</b></p>
</td>
<td width="142"><p><b>Excellent Text (TXT)</b></p>
</td>
<td width="142"><p><b>Text (Text)</b></p>
</td>
<td width="142"><p><b>Super Text (TXT)</b></p>
</td>
</tr><tr><td width="161"><p>Super Instruction</p>
</td>
<td width="142"><p>Super Instruction</p>
</td>
<td width="142"><p>Super Instruction</p>
</td>
<td width="142"><p>Super Instruction</p>
</td>
</tr><tr><td width="76"><p>SUBMIT TO: </p>
</td>
<td width="76"><p><b>Intermediary Text</b></p>
</td>
<td width="161"><p><b>Q.W. Super Good Text</b></p>
<p><b>Address:</b> Long Dong Plaza New York, United States</p>
<p><b>Sample:</b> 001068967</p>
<p><b>TEXT CODE:</b> TEXTT33</p>
<p><b>SUPER EXAMPLE:</b> 031111521</p>
</td>
<td width="142"><p><b>The Company of Super Compania</b> International Company Division<br>
 <b>Address:</b> 44 Wong Street West<br>
 Toronto, Ontario, Canada<br>
 <b>TEXT CODE: </b>BOBFFCDD</p>
</td>
<td width="142"><p><b>DGG Company, Belgium Company</b></p>
<p><b>Address:</b> Brussels, Belgium</p>
<p><b>Sample Number:</b> 201-0207080-43</p>
<p><b>TEXT CODE:</b> DDRUDEDD040</p>
</td>
<td width="142"><p><b>TDTT Company PLC</b><br>
 <b>Address:</b> 8 Red Square Chicken Head, London, England, E15 8HQ <br>
 <b>TEXT CODE:</b> BIBHGB77<br>
 <b>Sample:</b> 47605627</p>
</td>
</tr><tr><td width="76"><p>LETTER TO: </p>
</td>
<td width="76"><p><b>Excellent Company</b></p>
</td>
<td width="586" colspan="4"><p><b>Superexamplecompany (Lols &amp; Keks Ltd)</b></p>
<p><b>Address:</b> Somethingsuperimportant, Brothers and Sisters</p>
<p><b>TEXT Code:</b> BONTFQWE</p>
</td>
</tr><tr><td width="76"><p>:</p>
</td>
<td width="76"><p><b>Postal/ Courier's Information</b></p>
</td>
<td width="586" colspan="4"><p><b>Your Full Name or Your Company Full Name</b><br>
 <b>Address:</b> including Street Number, Street Name, City, Province/State, Country, and Postal Code <br>
 Your Sample <b>Code</b> and <b>Sample Number</b></p>
</td>
</tr></tbody></table>

</div>

What I need to do is to match the element based on the following criteria: contains the word "example" and the word "sample", both being case-insensitive and whole words only, as well as a number at least 3 digits long. In the HTML code above, only the following element matches that criteria:

<td width="161"><p><b>Q.W. Super Good Text</b></p>
<p><b>Address:</b> Long Dong Plaza New York, United States</p>
<p><b>Sample:</b> 001068967</p>
<p><b>TEXT CODE:</b> TEXTT33</p>
<p><b>SUPER EXAMPLE:</b> 031111521</p>
</td>

I have this huge XPath 1.0 expression:

//*[
  (
    contains(
      concat(
        ' ',
        translate(
          translate(., 'example', 'EXAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' EXAMPLE '
    ) and
    contains(
      concat(
        ' ',
        translate(
          translate(., 'sample', 'SAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' SAMPLE '
    )
  ) and
  translate(., translate(., '0123456789', ''), '') >= 3
]
[not(
*[
  (
    contains(
      concat(
        ' ',
        translate(
          translate(., 'example', 'EXAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' EXAMPLE '
    ) and
    contains(
      concat(
        ' ',
        translate(
          translate(., 'sample', 'SAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' SAMPLE '
    )
  ) and
  translate(., translate(., '0123456789', ''), '') >= 3
]
)]

While it's supposed to select only the element that doesn't have any children matching the same criteria (quoted above), for some reason it selects the whole parent element <tr>. I need a query that would only match that single td element of this table, but without restricting the query to a specific type of elements.

It is a requirement to use XPath 1.0, because the software I'm using (Octoparse) doesn't support newer XPath versions.

  • "number at least 3 digits long" doesn't seem to be implemented by `translate(., translate(., '0123456789', ''), '') >= 3`, it sounds more as if you want e.g. `string-length(translate(., translate(., '0123456789', ''), '')) >= 3` (although that would apply to e.g. `1x2x3`). Is that just your particular software that fails the `[not( *[..])]` test? I would think that test should prevent the `tr` parent from being selected if the `td` meets the conditions and I have tested the .NET XPath 1.0 implementation and it doesn't select the `tr`. – Martin Honnen Apr 07 '23 at 17:49
  • @MartinHonnen, the example code I provided is enclosed in several other parents, so the element is far from being the root element on the page. It's just too much to copy here entirely. Maybe you could reproduce the issue given these circumstances? Previous tests concluded that the software didn't fail the "not" test. – Price Mitchell Apr 08 '23 at 09:22

2 Answers2

0

Current XPath versions (i.e. XPath 3.1) are more expressive and have e.g. the innermost function

innermost(//*[
  (
    contains(
      concat(
        ' ',
        translate(
          translate(., 'example', 'EXAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' EXAMPLE '
    ) and
    contains(
      concat(
        ' ',
        translate(
          translate(., 'sample', 'SAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' SAMPLE '
    )
  ) and
  string-length(translate(., translate(., '0123456789', ''), '')) >= 3
])
Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
0

The key to finding a sequence of 3 digits in a string, using XPath 1.0, is to first convert the string you're searching within to replace all occurrences of each digit character with the same character (e.g. 0). Then you can search the resulting string for that character repeated 3 times (e.g. 000).

//*[
   contains(
      translate(
         .,
         '123456789', 
         '000000000'
      ),
      '000'
   )
]

Otherwise, the only way to detect a 3 digit number in a string in XPath 1.0 would be to test explicitly for the existence of every 3 digit number, e.g. as in this example where I've cut out most of the values to keep my answer short:

//*[
   contains(., '000') or
   contains(., '001') or
   contains(., '002') or
   contains(., '003') or
   contains(., '004') or
   contains(., '005') or
   contains(., '006') or

   contains(., '996') or
   contains(., '997') or
   contains(., '998') or
   contains(., '999')
]

NB you could certainly also search for elements which contain at least 3 digits, but as Martin points out in his comment, that would match string values like '1x4x6' which is not a 3 digit number:

//*
[
   string-length(.) - string-length(translate(., '0123456789', '')) >= 3
]

So I recommend this version of your XPath expression with the test for numbers corrected and updated:

//*[
  (
    contains(
      concat(
        ' ',
        translate(
          translate(normalize-space(.), 'example', 'EXAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' EXAMPLE '
    ) and
    contains(
      concat(
        ' ',
        translate(
          translate(normalize-space(.), 'sample', 'SAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' SAMPLE '
    )
  ) and contains(
      translate(
         .,
         '123456789', 
         '000000000'
      ),
      '000'
   )
]
[not(
*[
  (
    contains(
      concat(
        ' ',
        translate(
          translate(., 'example', 'EXAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' EXAMPLE '
    ) and
    contains(
      concat(
        ' ',
        translate(
          translate(normalize-space(.), 'sample', 'SAMPLE'), 
          ':,;.',
          '    '
        ),
        ' '
      ),
      ' SAMPLE '
    )
  ) and contains(
      translate(
         normalize-space(.),
         '123456789', 
         '000000000'
      ),
      '000'
   )
]
)]

And here's an example with your newly updated HTML running as an XPath fiddle. It returns a td element, as you can see.

Conal Tuohy
  • 2,561
  • 1
  • 8
  • 15
  • Thank you for your input, but the main issue here is that the query selects the tr element instead of the td element. I replaced the check for a 3-digit long number with a regular check for any number (`. != translate(., '0123456789', '')`), but it keeps selecting the `` element. – Price Mitchell Apr 08 '23 at 08:54
  • I must add that the example code I provided is enclosed in several other parents, so the `` element is far from being the root element on the page. It's just too much to copy here entirely. – Price Mitchell Apr 08 '23 at 09:14
  • I replaced the invalid part of your query (above) with a simple check for any digits and it returned me single `td` element in response which contained the text `EXAMPLE` and a whole bunch of digits. It did not return me a `tr` element. – Conal Tuohy Apr 08 '23 at 12:32
  • I've edited my answer to include a full example XPath with the numeric test fixed, and a link to an XPath fiddle website, showing that it works correctly with your sample data. – Conal Tuohy Apr 09 '23 at 03:47
  • I'm very sorry for the ultra late response, but now I've provided a more complete piece of that html page I'm working with, and even on Martin Honnen's XPath fiddle website it still matches the `` element instead of the innermost td element. Please advise. – Price Mitchell Apr 24 '23 at 00:03
  • I tested the XPath with your updated HTML (I had to replace `
    ` with `
    ` in order to parse it as XML) and tested it; the result was that it matched a `` element, not a ``. I will update my answer to include a link to Martin's XPath Fiddle website with your new HTML, so you can see for yourself.
    – Conal Tuohy Apr 24 '23 at 02:01
  • I've updated my sample HTML once again. I'm sorry for the confusion, this time it will definitely select `` instead of the ``, I promise. You can even save that sample html in an html file and open in Chrome and try to search the code using your expression, it will be the same. – Price Mitchell Apr 24 '23 at 23:31
  • Yes with your new data the XPath will indeed identify the `tr` element as the container for the words "example" and "sample", and this is correct, as I understand your requirement. The word "example" appears in one of the `td` elements, and the word "sample" appears in a different `td`, so the element which contains both words is the parent `tr`. – Conal Tuohy Apr 25 '23 at 01:03
  • No, my requirement is to match only the deepest element on the page that contains both "example", "sample" and a number of 3 or more digits. The element `` is the deepest element that contains all those things. Please look closely and you will see that it has all of the things I'm looking for. The word "sample" from other `` elements must be disregarded because then it matches a higher element ``, which is not what I need. – Price Mitchell Apr 25 '23 at 17:33
  • Ah yes the problem is that the expression's attempt to isolate individual words (by ensuring they're surrounded by spaces) is failing, because it doesn't treat the line break that appears before `Sample` as a word delimiter. I'll update the expression to normalize white space; that'll fix it. – Conal Tuohy Apr 25 '23 at 23:06
  • Sorry, but now it finds 2 elements: both `` and ``, while I only need the td. Again, you can just check in Chrome. – Price Mitchell Apr 26 '23 at 23:19
  • I remembered this time to use `normalize-space(.)` in the second clause of the expression, which I'd missed the first time. – Conal Tuohy Apr 28 '23 at 08:51