I am trying to learn to use Whoosh. I have a large collection of html documents I want to search. I discovered that the text_content() method creates some interesting problems for example I might have some text that is organized in a table that looks like
<html><table><tr><td>banana</td><td>republic</td></tr><tr><td>stateless</td><td>person</td></table></html>
When I take the original string and and get the tree and then use text_content to get the text in the following manner
mytree = html.fromstring(myString)
text = mytree.text_content()
The results have no spaces (as should be expected)
'bananarepublicstatelessperson'
I tried to insert new lines using string.replace()
myString = myString.replace('</tr>','</tr>\n')
I confirmed that the new line was present
'<html><table><tr><td>banana</td><td>republic</td></tr>\n<tr><td>stateless</td><td>person</td></table></html>'
but when I run the same code from above the line feeds are not present. Thus the resulting text_content() looks just like above. This is a problem from me because I need to be able to separate words, I thought I could add non-breaking spaces after each td and line breaks after rows as well asd line breaks after body elements etc to get text that reasonably conforms to my original source.
I will note that I did some more testing and found that line breaks inserted after paragraph tag closes were preserved. But there is a lot of text in the tables that I need to be able to search.
Thanks for any assistance