I have an issue using regex in lxml
etree.XPath
expression in Python 3.6
In this example i'm searching for 4 digit number surrounded by white space, on the stackoverflow homepage. And returning the xpath
of that element.
I'm getting match's that are just white space. I cant seem to be able to filter them out. My feelings are that it may be some encoding issue, but cant put my finger on it...
The picture below is from https://regex101.com/
. Correctly giving me 1 match.
here is a link to the homepage html: https://drive.google.com/open?id=0B3HIB_5rVAxmZU9ialZHdzhscE0
Here's my code
from lxml import html
from lxml import etree
with open('stackoverflow.html', 'r', encoding='utf8') as f:
page_html = f.read()
html_tree = html.fromstring(page_html)
regexpNS = "http://exslt.org/regular-expressions"
find = etree.XPath("//*[re:test(., '(\s\d{4}\s)', 'i')]",
namespaces={'re':regexpNS})
tree = etree.fromstring(page_html)
tree = etree.ElementTree(tree)
for element in find(tree):
text = str(element.text)
str(text).strip()
if text != '':
print(text)
print(len(text))
print(tree.getpath(element))
print('##############################################################')
Outputs
None
4
/ *
##############################################################
13
/ * / *[2]
##############################################################
13
/ * / *[2] / * [8]
##############################################################
17
/ * / *[2] / * [8] / *
##############################################################
21
/ * / *[2] / * [8] / * / *
##############################################################
25
/ * / *[2] / * [8] / * / * / * [18]
##############################################################
29
/ * / *[2] / * [8] / * / * / * [18] / *
##############################################################
33
/ * / *[2] / * [8] / * / * / * [18] / * / * [2]
##############################################################
site
design / logo © 2017
Stack
Exchange
Inc;
user
contributions
licensed
under
117
/ * / *[2] / * [8] / * / * / * [18] / * / * [2] / *
##############################################################
Whats up with the blank text lines with len
> 0 that should have been stripped???
Thanks!