I've written a script in python to scrape some text out of some html elements. The script can parse it now. However, the problem is the results look weird with bunch of spaces between them. How can I fix it? Any help will be highly appreciated.
This is the html elements the text should be scraped from:
html="""
<div class="postal-address">
<p>11525 23 AVE</p>
<p>EDMONTON,
AB
,
T6J 4T3
</p>
<p><a rel="nofollow" href="mailto:info@something.com">info@something.com</a></p>
<p><a rel="nofollow" href="http://www.something.org" target="_blank">Visit our Web Site</a></p>
</div>
"""
This is the script I'm trying with:
from lxml.html import fromstring
root = fromstring(html)
address = [item.text for item in root.cssselect(".postal-address p")]
print(address)
Result I'm having:
11525 23 AVE, EDMONTON,\n AB\n ,\n T6J 4T3\n
Expected result:
11525 23 AVE EDMONTON, AB, T6J 4T3
I tried to apply .strip()
and .replace("\n","")
in this line [item.text for item in root.cssselect(".postal-address p")]
but it threw an error showing none type object
.
Btw, i do not wish to have any solution related to regex
. Thanks in advance.