3

I am trying to find all div tags with id begins with "post-{here a lot of digits}" I tried something like this:

tree.xpath("//div[starts-with(@id,'post-[0-9]')]")

But does not really work. Is there a way to do this without importing regular expressions in python?

Geveze
  • 393
  • 3
  • 12
  • 1
    *does not really work* what do you mean ? – Raptor Jun 05 '13 at 08:20
  • whats the [0-9] for if you only care about the "post-" ? And off course what @Shivan Raptor said. – pypat Jun 05 '13 at 08:22
  • well, i don't get the desired output :). The [0-9] is not recognised by "starts-with" method. – Geveze Jun 05 '13 at 08:22
  • actually what i want is all divs with id='post-', like for example id='post-123456' – Geveze Jun 05 '13 at 08:23
  • 1
    Xpath 2 has a function called matches() it takes a regex. You could see if lexxml supports that and then do this: matches(@id, 'post-\d') or a like. – pypat Jun 05 '13 at 08:29
  • 2
    See [lxml docs: regular expressions in XPath](http://lxml.de/xpathxslt.html#regular-expressions-in-xpath) – Janne Karila Jun 05 '13 at 08:36

3 Answers3

2

XPath 1.0 does not support regular expressions, i.e. the function starts-with does not support regular expressions.

Lxml does not support XPath 2.0. You have the following three options:

  • Switch to a processor who is able to handle XPath 2.0. You can then use the fn:matches() function.

  • Use a XPath 1.0 compliant solution. This is rather ugly, but it works and may in some circumstances be the easiest solution. However, this is not a general solution! It will replace the numbers in @id with a - and match against this. So this would also deliver true if the original id was something like post--. Use a character which you know will not occur at this position.

tree.xpath("//div[starts-with(translate(@id, '0123456789', '----------'), 'post--')]")
  • lxml supports the EXSLT namespaces and you can use the regex functions from there. In my opinion this is the best solution.
regexpNS = "http://exslt.org/regular-expressions"
r = tree.xpath("//div[re:test(@id, '^post-[0-9]')]", namespaces={'re': regexpNS})
dirkk
  • 6,160
  • 5
  • 33
  • 51
  • The funny thing is in your last section that the site does not exist, but it works :) – Geveze Jun 05 '13 at 09:53
  • @Geveze This is not surprising at all. An XML namespace is an URI, but not an URL. Although they look the same in this case, this is just by convention to minimize the probability of a duplicate identifier and to make it human-readable. It could also be a simple string like `reg-ex`. Think of an XML namespaces as a string which is a unique identifier and has nothing to do with an URL. – dirkk Jun 05 '13 at 10:14
  • thnx for the explanation – Geveze Jun 05 '13 at 11:03
0

If you just want to check @id which may start with 'post-', xpath //div[starts-with(@id,'post-')] is enough. But if you are looking for @id which must be a combination of 'post-$AnyDigit then you must use matches() function.

Navin Rawat
  • 3,208
  • 1
  • 19
  • 31
0

The xpath-1.0 solution for issues like this would be to use translate().

E.g: translate( @id, '0123456789' , '0' ) will change any number to 0 ('123' -> '000'.

Therefor if you like to find for example "post-" followed by three digits use something like:

"//div[starts-with(translate( @id, '0123456789' , '0' ), 'post-000')]"
hr_117
  • 9,589
  • 1
  • 18
  • 23