Python lxml html xpath regex parsing

Question

I have an issue using regex in lxml etree.XPath expression in Python 3.6

In this example i'm searching for 4 digit number surrounded by white space, on the stackoverflow homepage. And returning the xpath of that element.

I'm getting match's that are just white space. I cant seem to be able to filter them out. My feelings are that it may be some encoding issue, but cant put my finger on it...

The picture below is from https://regex101.com/. Correctly giving me 1 match.

here is a link to the homepage html: https://drive.google.com/open?id=0B3HIB_5rVAxmZU9ialZHdzhscE0

Here's my code

from lxml import html
from lxml import etree

with open('stackoverflow.html', 'r', encoding='utf8') as f:
    page_html = f.read()

html_tree = html.fromstring(page_html)

regexpNS = "http://exslt.org/regular-expressions"
find = etree.XPath("//*[re:test(., '(\s\d{4}\s)', 'i')]",
                       namespaces={'re':regexpNS})

tree = etree.fromstring(page_html)
tree = etree.ElementTree(tree)
for element in find(tree):
    text = str(element.text)
    str(text).strip()
    if text != '':
        print(text)
        print(len(text))
        print(tree.getpath(element))
        print('##############################################################')

Outputs

    None
    4
    / *
    ##############################################################

    13
    / * / *[2]
    ##############################################################

    13
    / * / *[2] / * [8]
    ##############################################################

    17
    / * / *[2] / * [8] / *
    ##############################################################

    21
    / * / *[2] / * [8] / * / *
    ##############################################################

    25
    / * / *[2] / * [8] / * / * / * [18]
    ##############################################################

    29
    / * / *[2] / * [8] / * / * / * [18] / *
    ##############################################################

    33
    / * / *[2] / * [8] / * / * / * [18] / * / * [2]
    ##############################################################
    site
    design / logo © 2017
    Stack
    Exchange
    Inc;
    user
    contributions
    licensed
    under
    117
    / * / *[2] / * [8] / * / * / * [18] / * / * [2] / *
    ##############################################################

Whats up with the blank text lines with len > 0 that should have been stripped???

Thanks!

Hmm... no one explained why I was getting multiple results for the xpath regex expression... A work around was good enough. The root question is different to the reason it is a duplicate. I suppose it doesn't matter anyway. — James Schinner, Jul 01 '17 at 14:00

falsetru · Accepted Answer · 2017-07-01T13:16:43.287

0

str.strip returns a stripped text, but does not change text.

>>> text = '    a    '
>>> text.strip()   # returns a new string
'a'
>>> text  # `text` is not changed
'    a    '

If you want change text, you need to reassign the return value of the above expression back to text (BTW, you don't need to call str(..) because text is already a str object:

str(text).strip()

should be replaced with:

text = text.strip()

edited Jul 01 '17 at 13:16

answered Jul 01 '17 at 13:10

falsetru

357,413
63
732
636

Arhh, a string is immutable? That's why? and yeah i saw that but I was trying to Jam it into a type because it was being odd... – James Schinner Jul 01 '17 at 13:17
@JamesSchinner, Yes, strings are immutable in python. – falsetru Jul 01 '17 at 13:19
Yes that solved my problem. I'm now able to filter out the blanks. So thanks! Though, I still don't know why i'm getting blank matches and even a `None` type. – James Schinner Jul 01 '17 at 13:23
@JamesSchinner, Because it's converted to string `'None'` (because of `str(element.text)`. You can filter that out just below the `for ...`: `if element.text is None: continue` – falsetru Jul 01 '17 at 13:27
The None isn't a `None` it's a "None" - solved. Cheers mate! – James Schinner Jul 01 '17 at 13:27
@downvoter, Please let me know how to improve the answer. – falsetru Jul 01 '17 at 13:28

Python lxml html xpath regex parsing

1 Answers1