2

I am trying to get some text from an element, using pyquery 1.2. There are no spaces in the displayed text, but pyquery is inserting spaces.

Here is my code:

from pyquery import PyQuery as pq
html = '<h1><span class="highlight" style="background-color:">Randomized</span> and <span class="highlight" style="background-color:">non-randomized</span> <span class="highlight" style="background-color:">patients</span> in <span class="highlight" style="background-color:">clinical</span> <span class="highlight" style="background-color:">trials</span>: <span class="highlight" style="background-color:">experiences</span> with <span class="highlight" style="background-color:">comprehensive</span> <span class="highlight" style="background-color:">cohort</span> <span class="highlight" style="background-color:">studies</span>.</h1>'
doc = pq(html)
print doc('h1').text()

This produces (note spaces before colon and period):

Randomized and non-randomized patients in clinical trials : 
experiences with comprehensive cohort studies .

How can I stop pyquery inserting spaces into the text?

Richard
  • 62,943
  • 126
  • 334
  • 542

1 Answers1

5

After reading PyQuery's source I found that the text() method returns the following:

return ' '.join([t.strip() for t in text if t.strip()])

Which means that the content of non-empty tags will always be separated by a single space. I guess the problem is that the textual representation of html is not well-defined so I don't think it could be considered a bug--especially since the example in the text() documentation does exactly this:

>>> doc = PyQuery('<div><span>toto</span><span>tata</span></div>')
>>> print(doc.text())
toto tata

If you want another behavior, try implementing your own version of text(). You can use the original version for inspiration since it's only 10 lines or so.

André Laszlo
  • 15,169
  • 3
  • 63
  • 81
  • Thanks, good answer. Curious that the pyquery approach is different from how most browsers choose to display text. – Richard Apr 13 '15 at 10:57
  • Yeah it's a bit funny. Note that just using an empty string as the separator will not really work either, since PyQuery strips the content of each tag as well. I think it's a bit of a shortcut. Instead it should join everything without separators and remove consecutive spaces the way HTML is usually rendered. Even the example code above is rendered as "tototata" in most (all?) browsers. – André Laszlo Apr 13 '15 at 11:13