0

I have an issue using regex in lxml etree.XPath expression in Python 3.6

In this example i'm searching for 4 digit number surrounded by white space, on the stackoverflow homepage. And returning the xpath of that element.

I'm getting match's that are just white space. I cant seem to be able to filter them out. My feelings are that it may be some encoding issue, but cant put my finger on it...

The picture below is from https://regex101.com/. Correctly giving me 1 match.

here is a link to the homepage html: https://drive.google.com/open?id=0B3HIB_5rVAxmZU9ialZHdzhscE0

enter image description here

Here's my code

from lxml import html
from lxml import etree

with open('stackoverflow.html', 'r', encoding='utf8') as f:
    page_html = f.read()

html_tree = html.fromstring(page_html)

regexpNS = "http://exslt.org/regular-expressions"
find = etree.XPath("//*[re:test(., '(\s\d{4}\s)', 'i')]",
                       namespaces={'re':regexpNS})

tree = etree.fromstring(page_html)
tree = etree.ElementTree(tree)
for element in find(tree):
    text = str(element.text)
    str(text).strip()
    if text != '':
        print(text)
        print(len(text))
        print(tree.getpath(element))
        print('##############################################################')

Outputs

    None
    4
    / *
    ##############################################################

    13
    / * / *[2]
    ##############################################################

    13
    / * / *[2] / * [8]
    ##############################################################

    17
    / * / *[2] / * [8] / *
    ##############################################################

    21
    / * / *[2] / * [8] / * / *
    ##############################################################

    25
    / * / *[2] / * [8] / * / * / * [18]
    ##############################################################

    29
    / * / *[2] / * [8] / * / * / * [18] / *
    ##############################################################

    33
    / * / *[2] / * [8] / * / * / * [18] / * / * [2]
    ##############################################################
    site
    design / logo © 2017
    Stack
    Exchange
    Inc;
    user
    contributions
    licensed
    under
    117
    / * / *[2] / * [8] / * / * / * [18] / * / * [2] / *
    ##############################################################

Whats up with the blank text lines with len > 0 that should have been stripped???

Thanks!

James Schinner
  • 1,549
  • 18
  • 28
  • Hmm... no one explained why I was getting multiple results for the xpath regex expression... A work around was good enough. The root question is different to the reason it is a duplicate. I suppose it doesn't matter anyway. – James Schinner Jul 01 '17 at 14:00

1 Answers1

0

str.strip returns a stripped text, but does not change text.

>>> text = '    a    '
>>> text.strip()   # returns a new string
'a'
>>> text  # `text` is not changed
'    a    '

If you want change text, you need to reassign the return value of the above expression back to text (BTW, you don't need to call str(..) because text is already a str object:

str(text).strip()

should be replaced with:

text = text.strip()
falsetru
  • 357,413
  • 63
  • 732
  • 636