1

I've written a script in python to scrape some text out of some html elements. The script can parse it now. However, the problem is the results look weird with bunch of spaces between them. How can I fix it? Any help will be highly appreciated.

This is the html elements the text should be scraped from:

html="""
<div class="postal-address">
        <p>11525 23 AVE</p>


        <p>EDMONTON,
        AB
        ,
        T6J 4T3
        </p>

        <p><a rel="nofollow" href="mailto:info@something.com">info@something.com</a></p>
        <p><a rel="nofollow" href="http://www.something.org" target="_blank">Visit our Web Site</a></p>
    </div>
"""

This is the script I'm trying with:

from lxml.html import fromstring

root = fromstring(html)
address = [item.text for item in root.cssselect(".postal-address p")]
print(address)

Result I'm having:

11525 23 AVE, EDMONTON,\n        AB\n        ,\n        T6J 4T3\n

Expected result:

11525 23 AVE EDMONTON, AB, T6J 4T3

I tried to apply .strip() and .replace("\n","") in this line [item.text for item in root.cssselect(".postal-address p")] but it threw an error showing none type object.

Btw, i do not wish to have any solution related to regex. Thanks in advance.

SIM
  • 21,997
  • 5
  • 37
  • 109

3 Answers3

1

Try below solution and let me know in case of any issues:

address = [" ".join(item.text.split()).replace(" ,", ",") for item in root.cssselect(".postal-address p") if item.text]

Output:

['11525 23 AVE', 'EDMONTON, AB, T6J 4T3']
Andersson
  • 51,635
  • 17
  • 77
  • 129
0

when you do .replace("\n","") I think you have to escape the slash. This can be confusing sometimes and without trying it I can not tell you how many slasshes you need to escape it but try one of these....

.replace("\\n","")
.replace("\\\n","")
.replace("\\\\n","")

What happens when you use single quotes?

Brént Russęll
  • 756
  • 6
  • 17
0
  1. Split the source string on commas.
  2. Strip off any leading or trailing whitespace from each string in the resulting list.
  3. Join the strings using ', ' as the separator.

Like this:

src = '11525 23 AVE, EDMONTON,\n        AB\n        ,\n        T6J 4T3\n'
print(', '.join([s.strip() for s in src.split(',')]))

output

11525 23 AVE, EDMONTON, AB, T6J 4T3

If you already have a list of strings, this is even easier:

address = [
    '11525 23 AVE', 
    ' EDMONTON', 
    '\n        AB\n        ', 
    '\n        T6J 4T3\n'
]

print(', '.join([s.strip() for s in address]))
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • Thanks PM 2Ring, for your answer. It seems to be working but how should I apply the same in this line `[item.text for item in root.cssselect(".postal-address p")]` which is the main concern here. – SIM Oct 18 '17 at 11:25
  • @Topto Sorry, I thought you just needed to convert a single string, I didn't notice that you already have a list of strings, since in your "Result I'm having:" section there aren't any brackets getting printed. If you already have a list of strings then you don't need to do the `.split` step. I'll add some more code to my answer shortly. – PM 2Ring Oct 18 '17 at 11:32