1

This is the HTML sample:

<div class="wpb_text_column">
    <div class="wpb_wrapper">
      <p style="text-align: center;"><a href="http://somepage1.com">First text part </a></p>
      <p style="text-align: center;"><a href="http://somepage2.com">Second text part </a></p>
      <p style="text-align: center;"><a href="http://somepage3.com">Third text part</a></p>
    </div> 
</div>
<div class="wpb_text_column">
    <div class="wpb_wrapper">
      <p style="text-align: center;"><a href="http://somepage4.com">First text part </a></p>
      <p style="text-align: center;"><a href="http://somepage5.com">Second text part</a></p>
    </div> 
</div>

With below code

tree = html.fromstring(html_sample)
tree.xpath('//div[@class="wpb_text_column"]/div[@class="wpb_wrapper"]/p/a/text()')

I can get list of text values

['First text part ', 'Second text part ', 'Third text part', 'First text part ', 'Second text part']

However, I want to get all values from each div as single string like

['First text part Second text part Third text part', 'First text part Second text part']

and

//div[@class="wpb_text_column"]/div[@class="wpb_wrapper"]/normalize-space()

seem to be exact XPath to solve the problem, but lxml doesn't support /normalize-space() syntax:

lxml.etree.XPathEvalError: Invalid expression

So how to get desired output in lxml?

Andersson
  • 51,635
  • 17
  • 77
  • 129
  • There seems to be an option in lxml parser to ignore white space while parsing : http://stackoverflow.com/questions/3310614/remove-whitespaces-in-xml-string – SomeDude May 08 '17 at 15:10
  • Using `tree = html.fromstring(html_sample, parser=etree.XMLParser(remove_blank_text=True))` gives an error `lxml.etree.XMLSyntaxError: Extra content at the end of the document` – Andersson May 08 '17 at 15:26

1 Answers1

0

Solved with below code:

[" ".join(string.text_content().split()) for string in tree.xpath('//div[@class="wpb_text_column"]/div[@class="wpb_wrapper"]')]
Andersson
  • 51,635
  • 17
  • 77
  • 129