0

This question regards XPath expressions.

I want to find the average of the length of all URLs in a Web page, that point to a .pdf file.

So far I have constructed the following expression, but it does not work:

sum(string-length(string(//a/@href[contains(., ".pdf")]))) div count(//a/@href[contains(., ".pdf")])

Any help will be appreciated!

  • Which XPATH version you are looking for 1.0 or 2.0 – Navin Rawat May 17 '13 at 07:22
  • The XPath version is not an issue: if someone can, he can solve the problem in both versions. Momentarily I am testing the expression using FirePath (in Firefox). – Daniel Amariei May 17 '13 at 07:29
  • 1
    What should mean _XPath version is not an issue_. I do not think there is an solution for this with version 1.0. Also keep in mind that FirePath does not even support complete version 1.0 (or at least has sum issues) if I remember correctly. – hr_117 May 17 '13 at 08:06
  • Post some input. Does your document include namespaces? – Jens Erat May 17 '13 at 08:31
  • @JensErat You can assume that the document does not contain any namespaces. It should work on any valid HTML document (the structure is known in HTML -- **a** tags contains **href** attributes). I do not think the rest matters since the traversal can be made recursive. – Daniel Amariei May 17 '13 at 09:21
  • Then you should be totally fine with the XPath 2.0 expression given in my answer. And "I've got some XML document" does not imply not having namespaces (XHTML), so this is an important question. – Jens Erat May 17 '13 at 09:24
  • @hr_117 I mean that I am not constrained to produce the expression in one specific version. I just need it to work – Daniel Amariei May 18 '13 at 16:36
  • I did not think it mattered so much -- my bad. Thank you for your help! – Daniel Amariei May 18 '13 at 16:37

2 Answers2

0

You will need XPath 2.0.

For calculating the sum of the string lengths, you will need either

  • need a concatenated string of all @hrefs to apply to string-lenght($string as xs:string) (which only allows a single string as parameter), but concat(...) only takes an arbitrary number of atomar strings, not a sequence of those; or
  • apply string-length(...) on every @href as @Navin Rawat proposed - but using arbitrary functions in axis steps is a new feature of XPath 2.0.

If using XPath 2.0, there are functions avg(...) and ends-with(...) which help you in stripping down the expression to

avg(//a/@href[ends-with(., '.pdf')]/string-length())

If you have to stick with XPath 1.0, all you can do is using my expression below to fetch the URLs and calculate the average outside XPath.


Anyway, the subexpression you proposed will fail at URLs like http://example.net/myfile.pdf.txt. Only compare the end of the URL:

//a[@href[substring(., string-length(.) - 3) = '.pdf']]/@href

And you missed a path step for the attribute, so you've been trying to average the string length of the link names right now.

Jens Erat
  • 37,523
  • 16
  • 80
  • 96
-1

Please put something like:

sum(//a/@href[contains(.,'.pdf')]/string-length()) div count(//a/@href[contains(.,'.pdf')])
Navin Rawat
  • 3,208
  • 1
  • 19
  • 31
  • This is XPath 2.0, sums up the string length of the link texts instead of the URLs and fails for non-PDF-files _containing_ '.pdf' somewhere in their name. – Jens Erat May 17 '13 at 08:31
  • I will try the expression again when I find a proper XPath 2.0 tool to use. Thank you all! – Daniel Amariei May 17 '13 at 09:41