How do I preserve new lines when extracting text from html using lxml.text_content()

Question

I am trying to learn to use Whoosh. I have a large collection of html documents I want to search. I discovered that the text_content() method creates some interesting problems for example I might have some text that is organized in a table that looks like

<html><table><tr><td>banana</td><td>republic</td></tr><tr><td>stateless</td><td>person</td></table></html>

When I take the original string and and get the tree and then use text_content to get the text in the following manner

mytree = html.fromstring(myString)
text = mytree.text_content()

The results have no spaces (as should be expected)

'bananarepublicstatelessperson'

I tried to insert new lines using string.replace()

myString = myString.replace('</tr>','</tr>\n')

I confirmed that the new line was present

'<html><table><tr><td>banana</td><td>republic</td></tr>\n<tr><td>stateless</td><td>person</td></table></html>'

but when I run the same code from above the line feeds are not present. Thus the resulting text_content() looks just like above. This is a problem from me because I need to be able to separate words, I thought I could add non-breaking spaces after each td and line breaks after rows as well asd line breaks after body elements etc to get text that reasonably conforms to my original source.

I will note that I did some more testing and found that line breaks inserted after paragraph tag closes were preserved. But there is a lot of text in the tables that I need to be able to search.

Thanks for any assistance

score -1 · Answer 1 · edited May 23 '17 at 12:09

-1

You could use this solution:

import re
def striphtml(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

>>> striphtml('<a href="foo.com" class="bar">I Want This <b>text!</b></a>')
>>> 'I Want This text!'

Found here: using python, Remove HTML tags/formatting from a string

edited May 23 '17 at 12:09

Community

1
1

answered Oct 26 '14 at 16:05

William Fernandes

239
2
14

1

Thanks for this, but I have always read that one should avoid reg expressions for handling html. I was aware of this possibility but have avoided it and would probably resort to a manual approach using lxml before I did this. That is I would process each element sequentially as they are found and use rules – PyNEwbie Oct 26 '14 at 16:17
Your answer led me to html2text which I think is a better start so I marked up – PyNEwbie Oct 26 '14 at 17:15

How do I preserve new lines when extracting text from html using lxml.text_content()

1 Answers1